[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173681#comment-13173681 ] Hadoop QA commented on HBASE-5078: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508136/5078.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 75 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestCoprocessorEndpoint Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/556//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/556//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/556//console This message is automatically generated. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173685#comment-13173685 ] Hadoop QA commented on HBASE-5077: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508163/HBASE-5077.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/558//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/558//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/558//console This message is automatically generated. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173687#comment-13173687 ] Phabricator commented on HBASE-4916: mbautin has commented on the revision HBASE-4916 [jira] LoadTest MR Job. Added CCs: jdcryans, stack, tedyu, todd @stack, @tedyu, @todd, @jdcryans: this is a versatile map-reduce-based load tester that our intern Christopher Gist wrote. While there might be some overlap with PerformanceEvaluation and LoadTestTool, this load tester was developed and debugged independently and has already been useful to us in real-world tasks. Could you please take a quick look and let us know what you think is the best strategy of integrating this into the trunk? REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5077: -- Status: Open (was: Patch Available) SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5077: -- Attachment: HBASE-5077-v2.patch Forgot about the finally block, changing it again to basically add the correct return and print if process_failed is false. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173689#comment-13173689 ] Hadoop QA commented on HBASE-5078: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508165/5078-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 75 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/559//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/559//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/559//console This message is automatically generated. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5077: -- Fix Version/s: (was: 0.92.1) 0.92.0 Status: Patch Available (was: Open) SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173691#comment-13173691 ] Hudson commented on HBASE-5021: --- Integrated in HBase-TRUNK #2564 (See [https://builds.apache.org/job/HBase-TRUNK/2564/]) [jira] [HBase-5021] Enforce upper bound on timestamp Summary: We have been getting hit with performance problems on the ODS side due to invalid timestamps being inserted by the timestamp. ODS is working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Therefore, we should also allow the option to check the upper bound in HBase. This is the first draft. Probably should allow per-CF customization. Test Plan: - mvn test -Dtest=TestHRegion#testPutWithTsTooNew Reviewers: Kannan, Liyin, JIRA CC: stack, nspiegelberg, tedyu, Kannan, mbautin Differential Revision: 849 Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173697#comment-13173697 ] Shrijeet Paliwal commented on HBASE-5035: - Ted, you had mentioned following in the email thread: Null check for regionInfo should be added I could not gather why regionInfo could possibly be null. The call 'Writables.getHRegionInfo(value);' does not seem to return null ever. Could you please tell me your reasoning. Meanwhile I am still reading code and trying to find the place where NPE might occur. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu reopened HBASE-5021: --- TestHeapSize fails on Jenkins Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173706#comment-13173706 ] Phabricator commented on HBASE-5021: mbautin has commented on the revision [jira] [HBase-5021] Enforce upper bound on timestamp. This adds a new field to HRegion but does not update heap size (FIXED_OVERHEAD). I think the unit test still passes because of alignment, but I noticed this problem with the data block encoding patch applied. Let's see if TestHeapSize fails on Jenkins. REVISION DETAIL https://reviews.facebook.net/D849 COMMIT https://reviews.facebook.net/rHBASE1221532 Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173703#comment-13173703 ] Zhihong Yu commented on HBASE-5078: --- +1 on patch v3. Minor comment: openedNewFile sounds like a boolean. Would numNewlyOpenedFiles be a better name ? DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173705#comment-13173705 ] Hadoop QA commented on HBASE-5077: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508167/HBASE-5077-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/560//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/560//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/560//console This message is automatically generated. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173709#comment-13173709 ] Zhihong Yu commented on HBASE-5021: --- TestHeapSize of HBase-TRUNK #2564 on Jenkins failed. Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173716#comment-13173716 ] Zhihong Yu commented on HBASE-5035: --- I checked the code again and think regionInfo is unlikely to be null. Remaining possibilities are regionInfo.getTableDesc() and somewhere in the HSA ctor: {code} public HServerAddress(String hostAndPort) { {code} Making the trace more helpful should be the first action. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173730#comment-13173730 ] ramkrishna.s.vasudevan commented on HBASE-5009: --- @Ted {code} boolean stillRunning = !threadPool.awaitTermination( this.fileSplitTimeout, TimeUnit.MILLISECONDS); {code} We already do that. We wait for 30 secs. But the problem is the threads that were spawned are still alive and later they issue the create command to NN. Thus after we rollback and delete the splitdir it is again created by these threads. Correct me if am wrong Ted. Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Attachments: HBASE-5009.patch, HBASE-5009_Branch90.patch The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5073) Registered listeners not getting removed leading to memory leak in HBaseAdmin
[ https://issues.apache.org/jira/browse/HBASE-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173734#comment-13173734 ] ramkrishna.s.vasudevan commented on HBASE-5073: --- Tests are passing. Registered listeners not getting removed leading to memory leak in HBaseAdmin - Key: HBASE-5073 URL: https://issues.apache.org/jira/browse/HBASE-5073 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 Attachments: HBASE-5073.patch HBaseAdmin apis like tableExists(), flush, split, closeRegion uses catalog tracker. Every time Root node tracker and meta node tracker are started and a listener is registered. But after the operations are performed the listeners are not getting removed. Hence if the admin apis are consistently used then it may lead to memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173746#comment-13173746 ] Phabricator commented on HBASE-5033: Liyin has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Just discussed with Kannan offline, he found out that it wouldn't achieve the maximum parallelization across the regions from different tables, considering each table may has different number of stores. One proposal is to just bound the total number of threads for this parallelization process for each region. And the following formula stands: the max number of threads for opening stores * the max number of threads for opening stores files = the total number of threads we have configured. So If one region has less stores, this region will have more threads to run to open/close the store files in parallel and vice versa. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173764#comment-13173764 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508181/D447.9.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 65 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/561//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173780#comment-13173780 ] Shrijeet Paliwal commented on HBASE-5035: - Amm you might be right. {noformat} final String serverAddress = Bytes.toString(value); // instantiate the location HRegionLocation loc = new HRegionLocation(regionInfo, new HServerAddress(serverAddress)); {noformat} The Bytes.toString call, in theory, may return both an empty string or a null string. In the case when it returns a null (see below), it tries to log an error which I didn't see in my log file. So I am not still 100% sure this is out guy. {noformat} try { return new String(b, off, len, HConstants.UTF8_ENCODING); } catch (UnsupportedEncodingException e) { LOG.error(UTF-8 not supported?, e); return null; } {noformat} Nonetheless it will be good to put a check against serverAddress variable for emptiness as well nullness since HServerAddress construtor may throw runtime error otherwise. Interesting point is - it can throw both ArrayIndexOutOfBoundsException and NPE and I saw both cases. {noformat} /** * @param hostAndPort Hostname and port formatted as codelt;hostname ':' lt;port/code */ public HServerAddress(String hostAndPort) { int colonIndex = hostAndPort.lastIndexOf(':'); {noformat} I will open a subtask to make the trace more helpful. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5080) Make HConnectionManager's prefetchRegionCache error trace more helpful
Make HConnectionManager's prefetchRegionCache error trace more helpful -- Key: HBASE-5080 URL: https://issues.apache.org/jira/browse/HBASE-5080 Project: HBase Issue Type: Sub-task Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Priority: Minor We catch RuntimeException in this HConnectionManager's prefetchRegionCache method. The trace is not very helpful since it eats the information about the line where RuntimeException actually happened. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4120) isolation and allocation
[ https://issues.apache.org/jira/browse/HBASE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Jia updated HBASE-4120: --- Attachment: TablePriority_v17.patch Reduce the time used by TestPriorityJobQueue.java. isolation and allocation Key: HBASE-4120 URL: https://issues.apache.org/jira/browse/HBASE-4120 Project: HBase Issue Type: New Feature Components: master, regionserver Affects Versions: 0.90.2, 0.90.3, 0.90.4, 0.92.0 Reporter: Liu Jia Assignee: Liu Jia Fix For: 0.94.0 Attachments: Design_document_for_HBase_isolation_and_allocation.pdf, Design_document_for_HBase_isolation_and_allocation_Revised.pdf, HBase_isolation_and_allocation_user_guide.pdf, Performance_of_Table_priority.pdf, Simple_YCSB_Tests_For_TablePriority_Trunk_and_0.90.4.pdf, System Structure.jpg, TablePriority.patch, TablePriority_v12.patch, TablePriority_v12.patch, TablePriority_v15_with_coprocessor.patch, TablePriority_v16_with_coprocessor.patch, TablePriority_v17.patch, TablePriority_v17.patch, TablePriority_v8.patch, TablePriority_v8.patch, TablePriority_v8_for_trunk.patch, TablePrioriy_v9.patch The HBase isolation and allocation tool is designed to help users manage cluster resource among different application and tables. When we have a large scale of HBase cluster with many applications running on it, there will be lots of problems. In Taobao there is a cluster for many departments to test their applications performance, these applications are based on HBase. With one cluster which has 12 servers, there will be only one application running exclusively on this server, and many other applications must wait until the previous test finished. After we add allocation manage function to the cluster, applications can share the cluster and run concurrently. Also if the Test Engineer wants to make sure there is no interference, he/she can move out other tables from this group. In groups we use table priority to allocate resource, when system is busy; we can make sure high-priority tables are not affected lower-priority tables Different groups can have different region server configurations, some groups optimized for reading can have large block cache size, and others optimized for writing can have large memstore size. Tables and region servers can be moved easily between groups; after changing the configuration, a group can be restarted alone instead of restarting the whole cluster. git entry : https://github.com/ICT-Ope/HBase_allocation . We hope our work is helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4120) isolation and allocation
[ https://issues.apache.org/jira/browse/HBASE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173785#comment-13173785 ] jirapos...@reviews.apache.org commented on HBASE-4120: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1421/ --- (Updated 2011-12-21 02:32:29.569594) Review request for hbase. Changes --- Reduce the time used by TestPriorityJobQueue.java Summary --- Patch used for table priority alone,In this patch, not only tables can have different priorities but also the different actions like get,scan,put and delete can have priorities. This addresses bug HBase-4120. https://issues.apache.org/jira/browse/HBase-4120 Diffs (updated) - http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java 1220359 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java 1220359 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/PriorityFunction.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/PriorityHBaseServer.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/PriorityJobQueue.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/QosRegionObserver.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestMasterCoprocessorExceptionWithAbort.java 1220359 http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestMasterCoprocessorExceptionWithRemove.java 1220359 http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/GroupTestUtil.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/TestActionPriority.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/TestPriorityJobQueue.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/TestTablePriority.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/TestTablePriorityHundredRegion.java PRE-CREATION http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/ipc/TestTablePriorityLargeRow.java PRE-CREATION Diff: https://reviews.apache.org/r/1421/diff Testing --- Tested with test cases in TestCase_For_TablePriority_trunk_v1.patch please apply the patch of HBASE-4181 first,in some circumstances this bug will affect the performance of client. Thanks, Jia isolation and allocation Key: HBASE-4120 URL: https://issues.apache.org/jira/browse/HBASE-4120 Project: HBase Issue Type: New Feature Components: master, regionserver Affects Versions: 0.90.2, 0.90.3, 0.90.4, 0.92.0 Reporter: Liu Jia Assignee: Liu Jia Fix For: 0.94.0 Attachments: Design_document_for_HBase_isolation_and_allocation.pdf, Design_document_for_HBase_isolation_and_allocation_Revised.pdf, HBase_isolation_and_allocation_user_guide.pdf, Performance_of_Table_priority.pdf, Simple_YCSB_Tests_For_TablePriority_Trunk_and_0.90.4.pdf, System Structure.jpg, TablePriority.patch, TablePriority_v12.patch, TablePriority_v12.patch, TablePriority_v15_with_coprocessor.patch, TablePriority_v16_with_coprocessor.patch, TablePriority_v17.patch, TablePriority_v17.patch, TablePriority_v8.patch, TablePriority_v8.patch, TablePriority_v8_for_trunk.patch, TablePrioriy_v9.patch The HBase isolation and allocation tool is designed to help users manage cluster resource among different application and tables. When we have a large scale of HBase cluster with many applications running on it, there will be lots of problems. In Taobao there is a cluster for many departments to test their applications performance, these applications are based on HBase. With one cluster which has 12 servers, there will be only one application running exclusively on this server, and many other applications must wait until the previous test finished. After we add allocation manage function to the cluster, applications can share the cluster and run concurrently. Also if the Test Engineer wants to make sure there is no interference, he/she can move out other tables from this group. In groups we use table priority to allocate resource, when system is busy; we can make sure
[jira] [Updated] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-4218: --- Attachment: D447.10.patch mbautin updated the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). Reviewers: JIRA, tedyu, stack, nspiegelberg, Kannan Fixing fully-qualified class name in admin.rb. All unit tests passed, except TestReplication.queueFailover, which is known to be flaky. REVISION DETAIL https://reviews.facebook.net/D447 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java src/main/java/org/apache/hadoop/hbase/KeyValue.java src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodingAlgorithms.java src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java src/main/java/org/apache/hadoop/hbase/regionserver/Store.java src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java src/main/ruby/hbase/admin.rb src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java src/test/java/org/apache/hadoop/hbase/io/encoding/RedundantKVGenerator.java src/test/java/org/apache/hadoop/hbase/io/encoding/TestBufferedDataBlockEncoder.java src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportExport.java src/test/java/org/apache/hadoop/hbase/regionserver/CreateRandomStoreFile.java src/test/java/org/apache/hadoop/hbase/regionserver/DataBlockEncodingTool.java src/test/java/org/apache/hadoop/hbase/regionserver/EncodedSeekPerformanceTest.java src/test/java/org/apache/hadoop/hbase/regionserver/TestCompactSelection.java src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java src/test/java/org/apache/hadoop/hbase/regionserver/TestCompoundBloomFilter.java src/test/java/org/apache/hadoop/hbase/regionserver/TestFSErrorsExposed.java src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
[jira] [Commented] (HBASE-4120) isolation and allocation
[ https://issues.apache.org/jira/browse/HBASE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173789#comment-13173789 ] Hadoop QA commented on HBASE-4120: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508188/TablePriority_v17.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 24 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -137 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 86 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/562//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/562//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/562//console This message is automatically generated. isolation and allocation Key: HBASE-4120 URL: https://issues.apache.org/jira/browse/HBASE-4120 Project: HBase Issue Type: New Feature Components: master, regionserver Affects Versions: 0.90.2, 0.90.3, 0.90.4, 0.92.0 Reporter: Liu Jia Assignee: Liu Jia Fix For: 0.94.0 Attachments: Design_document_for_HBase_isolation_and_allocation.pdf, Design_document_for_HBase_isolation_and_allocation_Revised.pdf, HBase_isolation_and_allocation_user_guide.pdf, Performance_of_Table_priority.pdf, Simple_YCSB_Tests_For_TablePriority_Trunk_and_0.90.4.pdf, System Structure.jpg, TablePriority.patch, TablePriority_v12.patch, TablePriority_v12.patch, TablePriority_v15_with_coprocessor.patch, TablePriority_v16_with_coprocessor.patch, TablePriority_v17.patch, TablePriority_v17.patch, TablePriority_v8.patch, TablePriority_v8.patch, TablePriority_v8_for_trunk.patch, TablePrioriy_v9.patch The HBase isolation and allocation tool is designed to help users manage cluster resource among different application and tables. When we have a large scale of HBase cluster with many applications running on it, there will be lots of problems. In Taobao there is a cluster for many departments to test their applications performance, these applications are based on HBase. With one cluster which has 12 servers, there will be only one application running exclusively on this server, and many other applications must wait until the previous test finished. After we add allocation manage function to the cluster, applications can share the cluster and run concurrently. Also if the Test Engineer wants to make sure there is no interference, he/she can move out other tables from this group. In groups we use table priority to allocate resource, when system is busy; we can make sure high-priority tables are not affected lower-priority tables Different groups can have different region server configurations, some groups optimized for reading can have large block cache size, and others optimized for writing can have large memstore size. Tables and region servers can be moved easily between groups; after changing the configuration, a group can be restarted alone instead of restarting the whole cluster. git entry : https://github.com/ICT-Ope/HBase_allocation . We hope our work is helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173792#comment-13173792 ] Hadoop QA commented on HBASE-4218: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508190/D447.10.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 65 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/563//console This message is automatically generated. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5021: - Attachment: 5021-addendum.txt Small addendum to fix the broke TestHeapSize Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: 5021-addendum.txt, D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173828#comment-13173828 ] stack commented on HBASE-5021: -- Committed to TRUNK Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: 5021-addendum.txt, D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Attachment: distributed-log-splitting-screenshot.png Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173832#comment-13173832 ] stack commented on HBASE-5077: -- +1 on patch for branch and trunk. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173836#comment-13173836 ] Hadoop QA commented on HBASE-5078: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508195/5078-v4.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/564//console This message is automatically generated. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Attachment: 5078-v4.txt Changed variable name as per Ted suggestion. This is what I'm committing. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed branch and trunk. Thanks for review lads. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5077: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed branch and trunk. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5077: - Attachment: HBASE-5077-v4.txt This patch actually compiles. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173850#comment-13173850 ] stack commented on HBASE-5081: -- Or we need a means of uniquely identifying the failed split in the hashmap so when the callback runs, it only removes the pertinent tasks if present; i.e. the filename alone is not enough? Otherwise, sounds good Jimmy. Good find. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173857#comment-13173857 ] stack commented on HBASE-4720: -- @Mubarak That test fails for me too on a mac (I presume you are on a mac -- smile). Need to dig in. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173861#comment-13173861 ] stack commented on HBASE-4720: -- On patch: + Suggest you not do stuff like below in future because it bloats your patch making it more susceptible to rot and besides is not related directly to what you are trying to your fix so distracting for reviewers (for the next time): {code} - private static final HBaseRESTTestingUtility REST_TEST_UTIL = -new HBaseRESTTestingUtility(); + private static final HBaseRESTTestingUtility REST_TEST_UTIL = new HBaseRESTTestingUtility(); {code} + Is this safe? e.g. what if row has binary characters in it? Should these be base64'd or something? {code} +path.append('/'); +path.append(checkandput); +path.append('/'); +path.append(table); +path.append('/'); +path.append(row); +path.append('/'); +path.append(column); {code} I suppose its not needed in a test (I missed that this is test code) You have some lines that are way too long. Else patch looks good to me on cursory review. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173865#comment-13173865 ] Zhihong Yu commented on HBASE-4218: --- Thanks for the nice work, Mikhail. {code} 1 out of 1 hunk ignored -- saving rejects to file src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java.rej 1 out of 2 hunks FAILED -- saving rejects to file src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java.rej {code} Please fix the above conflicts by rebasing against TRUNK. Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173867#comment-13173867 ] Zhihong Yu commented on HBASE-5081: --- To expedite the release of 0.92, I think we can make deleteNode synchronous first. Once distributed log splitting works robustly, we can implement async deleteNode in a follow on JIRA. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173872#comment-13173872 ] Lars Hofhansl commented on HBASE-5078: -- Came here to +1... Too late :) Interesting that opening 35 files takes 25s. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173874#comment-13173874 ] Phabricator commented on HBASE-4916: stack has commented on the revision HBASE-4916 [jira] LoadTest MR Job. I started to review code then gave up to look more at what is attached. This looks better than PE and more easily extended IMO. Good documentation (could do w/ a package-info with overview of what it is and simple howto run it). Nice it adds itself to Driver. I think we should commit it after some fixup/review. I think it should go into tools package rather than into loadtest package so it'd be tools.loadtest. Nice one. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/loadtest/CompositeOperationGenerator.java:1 Missing license. src/main/java/org/apache/hadoop/hbase/loadtest/CompositeOperationGenerator.java:11 This looks promising src/main/java/org/apache/hadoop/hbase/loadtest/CompositeOperationGenerator.java:78 isEmpty is cheaper than size() == 0 src/main/java/org/apache/hadoop/hbase/loadtest/CompositeOperationGenerator.java:90 These exceptions could come out here? src/main/java/org/apache/hadoop/hbase/loadtest/DataGenerator.java:36 Whats a column? A cf+qualifier written as cf:qualifier? Or is this just a cf? src/main/java/org/apache/hadoop/hbase/loadtest/DataGenerator.java:71 Only longs allowed as keys? Is key == row? src/main/java/org/apache/hadoop/hbase/loadtest/DataGenerator.java:75 This should be called constructPut rather than constructBulkPut src/main/java/org/apache/hadoop/hbase/loadtest/DataGenerator.java:90 long key 'as String'... src/main/java/org/apache/hadoop/hbase/loadtest/GetGenerator.java:109 Should be logged at least rather than printStackTrace'd? Should we let it out? src/main/java/org/apache/hadoop/hbase/loadtest/GetOperation.java:47 Logged rather than printed src/main/java/org/apache/hadoop/hbase/loadtest/KeyCounter.java:32 There is an upper bound? Don't we need an upper bound so mappers don't overlap? Or is that not wanted? src/main/java/org/apache/hadoop/hbase/loadtest/KeyCounter.java:72 Why assign here? Why not declare and allocate in the one go? Could make the data members final then. src/main/java/org/apache/hadoop/hbase/loadtest/KeyCounter.java:78 Should be seeded? REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173876#comment-13173876 ] stack commented on HBASE-5078: -- Yeah, in-series with other stuff going on... (its too long!) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173879#comment-13173879 ] Hudson commented on HBASE-5077: --- Integrated in HBase-TRUNK #2565 (See [https://builds.apache.org/job/HBase-TRUNK/2565/]) HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173880#comment-13173880 ] Hudson commented on HBASE-5021: --- Integrated in HBase-TRUNK #2565 (See [https://builds.apache.org/job/HBase-TRUNK/2565/]) HBASE-5021 Enforce upper bound on timestamp stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: 5021-addendum.txt, D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173881#comment-13173881 ] Hudson commented on HBASE-5078: --- Integrated in HBase-TRUNK #2565 (See [https://builds.apache.org/job/HBase-TRUNK/2565/]) HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions stack : Files : * /hbase/trunk/CHANGES.txt stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4698) Let the HFile Pretty Printer print all the key values for a specific row.
[ https://issues.apache.org/jira/browse/HBASE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-4698: - Resolution: Fixed Fix Version/s: 0.94.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to TRUNK. Thanks for the patch Liyin. Let the HFile Pretty Printer print all the key values for a specific row. - Key: HBASE-4698 URL: https://issues.apache.org/jira/browse/HBASE-4698 Project: HBase Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.94.0 Attachments: D111.1.patch, D111.1.patch, D111.1.patch, D111.2.patch, D111.3.patch, D111.4.patch, HBASE-4689-trunk.patch When using HFile Pretty Printer to debug HBase issues, it would very nice to allow the Pretty Printer to seek to a specific row, and only print all the key values for this row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173888#comment-13173888 ] Phabricator commented on HBASE-5033: lhofhansl has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. The last proposal make more sense. The number of threads for opening store files within a number of stores opened in parallel would be hard to grok in a real cluster. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5070) Constraints implementation and javadoc changes
[ https://issues.apache.org/jira/browse/HBASE-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173890#comment-13173890 ] jirapos...@reviews.apache.org commented on HBASE-5070: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3273/#review4034 --- Ship it! This patch should go in. Good improvements. Doesn't address the bigger issues of Configuration inline w/ HTD -- have you tried it, I mean, IIRC, though you might have one custom config only, the whole Configuration will be output per Constraint? -- and adding support to shell. Those are in different issues? src/docbkx/book.xml https://reviews.apache.org/r/3273/#comment9139 Only thing missing is since 0.94 which is when constraints will show up. src/main/java/org/apache/hadoop/hbase/constraint/BaseConstraint.java https://reviews.apache.org/r/3273/#comment9140 Better - Michael On 2011-12-20 19:14:46, Jesse Yates wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3273/ bq. --- bq. bq. (Updated 2011-12-20 19:14:46) bq. bq. bq. Review request for hbase, Gary Helmling, Ted Yu, and Michael Stack. bq. bq. bq. Summary bq. --- bq. bq. Follow-up on changes to constraint as per stack's comments on HBASE-4605. bq. bq. bq. This addresses bug HBASE-5070. bq. https://issues.apache.org/jira/browse/HBASE-5070 bq. bq. bq. Diffs bq. - bq. bq.src/docbkx/book.xml bd3f881 bq.src/main/java/org/apache/hadoop/hbase/constraint/BaseConstraint.java 7ce6d45 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraint.java 2d8b4d7 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraints.java 7825466 bq.src/main/java/org/apache/hadoop/hbase/constraint/package-info.java 6145ed5 bq. src/test/java/org/apache/hadoop/hbase/constraint/CheckConfigurationConstraint.java c49098d bq. bq. Diff: https://reviews.apache.org/r/3273/diff bq. bq. bq. Testing bq. --- bq. bq. mvn clean test -P localTests -Dest=*Constraint* - all tests pass. bq. bq. bq. Thanks, bq. bq. Jesse bq. bq. Constraints implementation and javadoc changes -- Key: HBASE-5070 URL: https://issues.apache.org/jira/browse/HBASE-5070 Project: HBase Issue Type: Task Reporter: Zhihong Yu This is continuation of HBASE-4605 See Stack's comments https://reviews.apache.org/r/2579/#review3980 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5060: - Resolution: Fixed Status: Resolved (was: Patch Available) Was committed a few days ago. HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.92.0, 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
[jira] [Commented] (HBASE-4099) Authentication for ThriftServer clients
[ https://issues.apache.org/jira/browse/HBASE-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173897#comment-13173897 ] stack commented on HBASE-4099: -- These have been committed? Authentication for ThriftServer clients --- Key: HBASE-4099 URL: https://issues.apache.org/jira/browse/HBASE-4099 Project: HBase Issue Type: Sub-task Components: security Reporter: Gary Helmling Attachments: HBASE-4099.patch The current implementation of HBase client authentication only works with the Java API. Alternate access gateways, like Thrift and REST are left out and will not work. For the ThriftServer to be able to fully interoperate with the security implementation: # the ThriftServer should be able to login from a keytab file with it's own server principal on startup # thrift clients should be able to authenticate securely when connecting to the server # the ThriftServer should be able to act as a proxy for those clients so that the RPCs it issues will be correctly authorized as the original client identities There is already some support for step 3 in UserGroupInformation and related classes. For step #2, we really need to look at what thrift itself supports. At a bare minimum, we need to implement step #1. If we do this, even without steps 2 3, this would at least allow deployments to use a ThriftServer per application user, and have the server login as that user on startup. Thrift clients may not be directly authenticated, but authorization checks for HBase could still be handled correctly this way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3523) Rewrite our client (client 2.0)
[ https://issues.apache.org/jira/browse/HBASE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173906#comment-13173906 ] stack commented on HBASE-3523: -- Breathing some life back into this issue: Reasons for new client, updates: + The Jonathan Payne accounting of unaccounted off-heap socket buffers per thread which makes our client OOME when lots of threads (HBASE-4956) + complex, lots of layers, long-lived zk connection (not necessary on client?) + Should work against multiple versions of hbase (but that might be another issue, an rpc issue... this issue could be distinct from rpc fixup?) See also Lars comment here: https://issues.apache.org/jira/browse/HBASE-5058?focusedCommentId=13173364page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13173364 Rewrite our client (client 2.0) --- Key: HBASE-3523 URL: https://issues.apache.org/jira/browse/HBASE-3523 Project: HBase Issue Type: Brainstorming Reporter: stack Is it just me or do others sense that there is pressure building to redo the client? If just me, ignore the below... I'll just keep notes in here. Otherwise, what would the requirements for a client rewrite look like? + Let out InterruptedException + Enveloping of messages or space for metadata that can be passed by client to server and by server to client; e.g. the region a.b.c moved to server x.y.z. or scanner is finished or timeout + A different RPC? One with tighter serialization. + More sane timeout/retry policy. Does it have to support async communication? Do callbacks? What else? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5058) Allow HBaseAmin to use an existing connection
[ https://issues.apache.org/jira/browse/HBASE-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173908#comment-13173908 ] stack commented on HBASE-5058: -- @Lars We already have an issue, hbase-3523 Rewrite our client (client 2.0) Allow HBaseAmin to use an existing connection - Key: HBASE-5058 URL: https://issues.apache.org/jira/browse/HBASE-5058 Project: HBase Issue Type: Sub-task Components: client Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 5058-v2.txt, 5058-v3.txt, 5058-v3.txt, 5058.txt What HBASE-4805 does for HTables, this should do for HBaseAdmin. Along with this the shared error handling and retrying between HBaseAdmin and HConnectionManager can also be improved. I'll attach a first pass patch soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173914#comment-13173914 ] Mubarak Seyed commented on HBASE-4720: -- Thanks Stack. bq. Suggest you not do stuff like below in future because it bloats your patch making it more susceptible to rot and besides is not related directly to what you are trying to your fix so distracting for reviewers (for the next time): I had formatted the code using [-HBASE-3678-|https://issues.apache.org/jira/browse/HBASE-3678] {{eclipse_formatter_apache.xml}}. I apologize for messing up with format. Can you please advice on code formatting? (i believe hbase book also refers HBASE-3678 for code formatting) bq. Is this safe? e.g. what if row has binary characters in it? Should these be base64'd or something? Other test methods in {{src/test/java/org/apache/hadoop/hbase/rest/TestRowResources.java}} uses the same way to build the URI ({{deleteRow, deleteValue, putValuePB,}} etc). I just copied the code from other methods. bq. You have some lines that are way too long. {{eclipse-code-formatter.xml}} uses line length as 80, please advice on line length. {code} setting id=org.eclipse.jdt.core.formatter.comment.line_length value=80/ {code} Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4956) Control direct memory buffer consumption by HBaseClient
[ https://issues.apache.org/jira/browse/HBASE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173917#comment-13173917 ] Lars Hofhansl commented on HBASE-4956: -- Just checked the HTable code in trunk... Puts (even single put operations) already go through a thread from the HTable's thread pool. A single get request is issuing a request directly, but get(ListGet) is also using the pool. So limiting the number of threads in a single thread pool and using the new HTable constructor from HBASE-4805, this can be controlled. Need to make one is using only get(ListGet) in addition to a relatively small global threadpool. At Salesforce we'll shoot for pool with ~100 threads and a waiting queue of about 100 as well (for very large clusters that may be too small though). Control direct memory buffer consumption by HBaseClient --- Key: HBASE-4956 URL: https://issues.apache.org/jira/browse/HBASE-4956 Project: HBase Issue Type: New Feature Reporter: Ted Yu As Jonathan explained here https://groups.google.com/group/asynchbase/browse_thread/thread/c45bc7ba788b2357?pli=1 , standard hbase client inadvertently consumes large amount of direct memory. We should consider using netty for NIO-related tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173933#comment-13173933 ] Hudson commented on HBASE-5077: --- Integrated in HBase-0.92 #205 (See [https://builds.apache.org/job/HBase-0.92/205/]) HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS stack : Files : * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173934#comment-13173934 ] Hudson commented on HBASE-5078: --- Integrated in HBase-0.92 #205 (See [https://builds.apache.org/job/HBase-0.92/205/]) HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira