[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173982#comment-13173982 ] Hudson commented on HBASE-5078: --- Integrated in HBase-0.92-security #47 (See [https://builds.apache.org/job/HBase-0.92-security/47/]) HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173981#comment-13173981 ] Hudson commented on HBASE-5077: --- Integrated in HBase-0.92-security #47 (See [https://builds.apache.org/job/HBase-0.92-security/47/]) HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS stack : Files : * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173984#comment-13173984 ] nkeywal commented on HBASE-5064: #554 (No //, min number of threads) Total time: 1:55:37.351s Tests run: 781, Failures: 3, Errors: 2, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung TestReplication use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Open (was: Patch Available) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Attachment: 5064.v6.patch use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5082) ColumnAggregationEndPoint- Causes null pointer in RS when we pass null column qualifier
ColumnAggregationEndPoint- Causes null pointer in RS when we pass null column qualifier --- Key: HBASE-5082 URL: https://issues.apache.org/jira/browse/HBASE-5082 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Minor I was trying to use the ColumnAggregationEndPoint.sum(). In my sample is just created a column family but did not use any qualifier and inserted some data. I tried to use ColumnAggregationEndPoint.sum(qualifier, null). When i did this inside the ColumnAggregationEndPoint we do scan.addColumn(). This is adding the [null] array in the scan object. Later in the scanQueryMatcher it is throwing nullpointer exception. I can understand that addColumn() is to specifiy the qualifier. Do we need to document somewhere saying qualifier should not be null? I think coprocessors can be used even in places where we don't have qualifiers. If that is the case this sample ColumnAggregationEndPoint may not work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4698) Let the HFile Pretty Printer print all the key values for a specific row.
[ https://issues.apache.org/jira/browse/HBASE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174036#comment-13174036 ] Hudson commented on HBASE-4698: --- Integrated in HBase-TRUNK-security #39 (See [https://builds.apache.org/job/HBase-TRUNK-security/39/]) HBASE-4698 Let the HFile Pretty Printer print all the key values for a specific row; REMOVE OVER-COMMIT, STUFF NOT IN 4698 PATCH HBASE-4698 Let the HFile Pretty Printer print all the key values for a specific row stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java Let the HFile Pretty Printer print all the key values for a specific row. - Key: HBASE-4698 URL: https://issues.apache.org/jira/browse/HBASE-4698 Project: HBase Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.94.0 Attachments: D111.1.patch, D111.1.patch, D111.1.patch, D111.2.patch, D111.3.patch, D111.4.patch, HBASE-4689-trunk.patch When using HFile Pretty Printer to debug HBase issues, it would very nice to allow the Pretty Printer to seek to a specific row, and only print all the key values for this row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174037#comment-13174037 ] Hudson commented on HBASE-5072: --- Integrated in HBase-TRUNK-security #39 (See [https://builds.apache.org/job/HBase-TRUNK-security/39/]) [jira] [HBASE-5072] Support Max Value for Per-Store Metrics Summary: We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. Test Plan: - mvn test -Dtest=TestRegionServerMetrics Reviewers: JIRA, mbautin, Kannan Reviewed By: Kannan CC: stack, nspiegelberg, mbautin, Kannan Differential Revision: 945 nspiegelberg : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaMetrics.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174035#comment-13174035 ] Hudson commented on HBASE-5077: --- Integrated in HBase-TRUNK-security #39 (See [https://builds.apache.org/job/HBase-TRUNK-security/39/]) HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS -- fix compile error HBASE-5077 SplitLogWorker fails to let go of a task, kills the RS stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5077-v2.patch, HBASE-5077-v4.txt, HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174039#comment-13174039 ] Hudson commented on HBASE-5078: --- Integrated in HBase-TRUNK-security #39 (See [https://builds.apache.org/job/HBase-TRUNK-security/39/]) HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions stack : Files : * /hbase/trunk/CHANGES.txt stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078-v4.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174077#comment-13174077 ] ramkrishna.s.vasudevan commented on HBASE-5009: --- @Ted Not sure how long it may take.. but it should not be too long.. But this will ensure that the background threads are completed. If not we may get more such errors {code} org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/Htable_UFDR_015/914f3eff1aa7649b7a99b8ce86f3f39c/splits/42423e22baa0c02897a27f92a265f2ad/value/1480356236187113033.914f3eff1aa7649b7a99b8ce86f3f39c File does not exist. [Lease. Holder: DFSClient_hb_rs_193-195-18-165,20020,1324435953313_1324435953584_377812698_27, pendingcreates: 1] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1786) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1777) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1834) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:738) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1116) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1112) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1110) at org.apache.hadoop.ipc.Client.call(Client.java:954) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245) at $Proxy6.complete(Unknown Source) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:183) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:171) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:76) at $Proxy6.complete(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4307) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:4211) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:63) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:88) at org.apache.hadoop.hbase.io.Reference.write(Reference.java:133) at org.apache.hadoop.hbase.regionserver.StoreFile.split(StoreFile.java:651) at org.apache.hadoop.hbase.regionserver.SplitTransaction.splitStoreFile(SplitTransaction.java:498) {code} Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Attachments: HBASE-5009.patch, HBASE-5009_Branch90.patch The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174080#comment-13174080 ] Hadoop QA commented on HBASE-5064: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508218/5064.v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestInstantSchemaChange org.apache.hadoop.hbase.regionserver.wal.TestLogRolling org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/565//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/565//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/565//console This message is automatically generated. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Open (was: Patch Available) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Attachment: 5064.v6.patch use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174089#comment-13174089 ] nkeywal commented on HBASE-5064: #565 (No //, min number of threads) Total time: 1:59:24.242s Tests run: 788, Failures: 3, Errors: 3, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung None it seems, and at it least not TestReplication The directory is already locked. TestLogRolling.testLogRollOnDatanodeDeath use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Patch Available (was: Open) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4698) Let the HFile Pretty Printer print all the key values for a specific row.
[ https://issues.apache.org/jira/browse/HBASE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174116#comment-13174116 ] Hudson commented on HBASE-4698: --- Integrated in HBase-TRUNK #2566 (See [https://builds.apache.org/job/HBase-TRUNK/2566/]) HBASE-4698 Let the HFile Pretty Printer print all the key values for a specific row; REMOVE OVER-COMMIT, STUFF NOT IN 4698 PATCH HBASE-4698 Let the HFile Pretty Printer print all the key values for a specific row stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java Let the HFile Pretty Printer print all the key values for a specific row. - Key: HBASE-4698 URL: https://issues.apache.org/jira/browse/HBASE-4698 Project: HBase Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.94.0 Attachments: D111.1.patch, D111.1.patch, D111.1.patch, D111.2.patch, D111.3.patch, D111.4.patch, HBASE-4689-trunk.patch When using HFile Pretty Printer to debug HBase issues, it would very nice to allow the Pretty Printer to seek to a specific row, and only print all the key values for this row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174131#comment-13174131 ] stack commented on HBASE-5064: -- So above tests are w/o //? (I'm getting no response from Giri) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5009: -- Status: Patch Available (was: Open) Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Attachments: HBASE-5009.patch, HBASE-5009_Branch90.patch The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174164#comment-13174164 ] Hadoop QA commented on HBASE-5009: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507675/HBASE-5009.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 75 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/567//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/567//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/567//console This message is automatically generated. Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Attachments: HBASE-5009.patch, HBASE-5009_Branch90.patch The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4951) master process can not be stopped when it is initializing
[ https://issues.apache.org/jira/browse/HBASE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-4951: -- Attachment: HBASE-4951_branch.patch Patch for branch0.90 master process can not be stopped when it is initializing - Key: HBASE-4951 URL: https://issues.apache.org/jira/browse/HBASE-4951 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: xufeng Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.90.6 Attachments: HBASE-4951.patch, HBASE-4951_branch.patch It is easy to reproduce by following step: step1:start master process.(do not start regionserver process in the cluster). the master will wait the regionserver to check in: org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin step2:stop the master by sh command bin/hbase master stop result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4951) master process can not be stopped when it is initializing
[ https://issues.apache.org/jira/browse/HBASE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174170#comment-13174170 ] Hadoop QA commented on HBASE-4951: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508251/HBASE-4951_branch.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/568//console This message is automatically generated. master process can not be stopped when it is initializing - Key: HBASE-4951 URL: https://issues.apache.org/jira/browse/HBASE-4951 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: xufeng Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.90.6 Attachments: HBASE-4951.patch, HBASE-4951_branch.patch It is easy to reproduce by following step: step1:start master process.(do not start regionserver process in the cluster). the master will wait the regionserver to check in: org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin step2:stop the master by sh command bin/hbase master stop result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174180#comment-13174180 ] Jimmy Xiang commented on HBASE-5081: I am working on a path now. I think synchronous deleteNode is clean. It will give retry a fresh start. But it may take a while if there are too many files. Yes, for long term, we can think about how to do what stack says. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174187#comment-13174187 ] Jimmy Xiang commented on HBASE-5081: Can we deleteNode only if it is successfully done? If it is not completed, let the node stay there. In this case, when the retry happens, it should see the old node there, but it is ok. The new task in the hashmap won't be deleted either. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Status: Patch Available (was: Open) This is a patch for 92. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174208#comment-13174208 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174207#comment-13174207 ] nkeywal commented on HBASE-5064: #566 (No //, min number of threads) Total time: 1:56:55.198s Tests run: 789, Failures: 3, Errors: 2, Skipped: 9 Too many open files TestInstantSchemaChange.testInstantSchemaJanitor NumberFormatException TestTableMapReduce.testMultiRegionTable TestHFileOutputFormat.testMRIncrementalLoad TestHFileOutputFormat.testMRIncrementalLoadWithSplit TestHFileOutputFormat.testExcludeMinorCompaction Hung None use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174209#comment-13174209 ] Zhihong Yu commented on HBASE-4720: --- I posted comment on review board. @Andrew: Can you take a look at the latest patch ? Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174214#comment-13174214 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- (Updated 2011-12-21 17:01:22.901024) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes --- Updated the comments. Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174218#comment-13174218 ] Phabricator commented on HBASE-4916: stack has commented on the revision HBASE-4916 [jira] LoadTest MR Job. Looking back at PE, here is what it has that this tool doesn't and in some cases might be hard to add: + Runs against minicluster (Should be easy to add) + Can sequential read/writes (Not so important, though sequential write can be interesting) + Can do scan tests; random 10 items, random 100, etc. Stuff this tool can do that would need to be added to PE: + PE needs to do more work per mapper; at moment doing dumb stuff like running 24 mappers per TT just to get the load up + Better load profiles; PE loading is regular REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Attachment: patch_for_92_v2.txt Updated the comments. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Patch Available (was: Open) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Attachment: 5064.v8.patch use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Open (was: Patch Available) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174223#comment-13174223 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4046 --- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9151 White space on end and 'is' should be 'if' - Michael On 2011-12-21 17:01:22, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:01:22) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174224#comment-13174224 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 --- Ship it! Looks good to me. How I know it works? Would it be hard to write a test? - Michael On 2011-12-21 17:01:22, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:01:22) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174225#comment-13174225 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- bq. On 2011-12-21 17:06:22, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 358 bq. https://reviews.apache.org/r/3292/diff/2/?file=65660#file65660line358 bq. bq. White space on end and 'is' should be 'if' Let me fix it. - Jimmy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4046 --- On 2011-12-21 17:01:22, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:01:22) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174228#comment-13174228 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- (Updated 2011-12-21 17:12:36.185307) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes --- Fixed the comments. Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Attachment: patch_for_92_v3.txt Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174234#comment-13174234 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4050 --- Interesting idea. Let us know result of testing in real cluster. Since the zk node may live across cluster restart, we should verify the change works in that scenario. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9158 I think the new comments should be lifted to line 353 for clarity. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9156 Should read 'if the task failed' src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9157 Indentation should be 2 spaces. - Ted On 2011-12-21 17:12:36, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:12:36) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174236#comment-13174236 ] Zhihong Yu commented on HBASE-4720: --- @Mubarak: I think the rule you posted only governs comment. I tried auto-formatting a long line which has 3 identical short statements. There was no auto-wrapping. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174249#comment-13174249 ] Phabricator commented on HBASE-4916: tedyu has commented on the revision HBASE-4916 [jira] LoadTest MR Job. Nice tool. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/loadtest/GetGenerator.java:81 Should be seeded. src/main/java/org/apache/hadoop/hbase/loadtest/RandomDataGenerator.java:94 Please seed as what is done at line 61. src/main/java/org/apache/hadoop/hbase/loadtest/KeyCounter.java:103 I think we should add some message to NoKeysException so that this scenario can be distinguished from that of line 90. REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174250#comment-13174250 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- (Updated 2011-12-21 17:40:59.028814) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes --- Fixed the indentation issue, added more comment. Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174251#comment-13174251 ] Jimmy Xiang commented on HBASE-5081: The patch is for both 0.92 and 0.94 actually. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174252#comment-13174252 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4052 --- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9159 Should read 'Asynchronously deleting' src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9160 Should read 'if the task failed and was not an orphan' - Ted On 2011-12-21 17:40:59, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:40:59) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174254#comment-13174254 ] Zhihong Yu commented on HBASE-5064: --- Build #566 looks good. Shall we increase parallelism and run test suite again ? use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4951) master process can not be stopped when it is initializing
[ https://issues.apache.org/jira/browse/HBASE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-4951: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508251/HBASE-4951_branch.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/568//console This message is automatically generated.) master process can not be stopped when it is initializing - Key: HBASE-4951 URL: https://issues.apache.org/jira/browse/HBASE-4951 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: xufeng Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.90.6 Attachments: HBASE-4951.patch, HBASE-4951_branch.patch It is easy to reproduce by following step: step1:start master process.(do not start regionserver process in the cluster). the master will wait the regionserver to check in: org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin step2:stop the master by sh command bin/hbase master stop result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5083) Backup HMaster should have http infoport open with link to the active master
Backup HMaster should have http infoport open with link to the active master Key: HBASE-5083 URL: https://issues.apache.org/jira/browse/HBASE-5083 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Without ssh'ing and jps/ps'ing, it is difficult to see if a backup hmaster is up. It seems like it would be good for a backup hmaster to have a basic web page up on the info port so that users could see that it is up. Also it should probably either provide a link to the active master. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174259#comment-13174259 ] Hadoop QA commented on HBASE-5081: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508264/patch_for_92_v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.replication.TestMasterReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/570//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/570//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/570//console This message is automatically generated. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4951) master process can not be stopped when it is initializing
[ https://issues.apache.org/jira/browse/HBASE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174260#comment-13174260 ] Zhihong Yu commented on HBASE-4951: --- Minor comments: {code} +if (LOG.isTraceEnabled()) { + LOG.info(.META. still not available, sleeping and retrying. + {code} Should LOG.isInfoEnabled() be called above ? The following two comments apply to ZooKeeperNodeTracker.java and CatalogTracker.java: {code} +// chk if the shutdown node is available. Because incase of cluster +// shutdonw +// this loop prevents the master from going down. {code} The above should read: check if the shutdown node exists. In case of cluster shutdown, ... {code} + Unexpected exception handling while checking if shutdown node exists., {code} Extra word above: handling master process can not be stopped when it is initializing - Key: HBASE-4951 URL: https://issues.apache.org/jira/browse/HBASE-4951 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: xufeng Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.90.6 Attachments: HBASE-4951.patch, HBASE-4951_branch.patch It is easy to reproduce by following step: step1:start master process.(do not start regionserver process in the cluster). the master will wait the regionserver to check in: org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin step2:stop the master by sh command bin/hbase master stop result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Attachment: hbase-5081_patch_v5.txt Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174261#comment-13174261 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- (Updated 2011-12-21 18:06:55.460515) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes --- Changed some comments Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174262#comment-13174262 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- bq. On 2011-12-21 17:44:53, Ted Yu wrote: bq. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 368 bq. https://reviews.apache.org/r/3292/diff/4/?file=65674#file65674line368 bq. bq. Should read 'if the task failed and was not an orphan' The original is fine. - Jimmy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4052 --- On 2011-12-21 18:06:55, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 18:06:55) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174271#comment-13174271 ] Phabricator commented on HBASE-5021: Kannan has commented on the revision [jira] [HBase-5021] Enforce upper bound on timestamp. Liyin reported that the TestHeapSize didn't pass in the daily builds. REVISION DETAIL https://reviews.facebook.net/D849 COMMIT https://reviews.facebook.net/rHBASE1221532 Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: 5021-addendum.txt, D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174277#comment-13174277 ] nkeywal commented on HBASE-5064: No, we need the hadoopqa setting increased before activating //. Here, I am launching multiple tests to identify most of the random issues we can have, this will help to distinguish what is random vs. what is due to //. However, I can split this patch in smaller ones: 1) one to activate the lastest version of surefire 2) one for the minimal thread settings 3) one for the // itself. From all the tests done so far, 1 2 seem ok today. On Wed, Dec 21, 2011 at 6:49 PM, Zhihong Yu (Commented) (JIRA) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v8.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174313#comment-13174313 ] Hadoop QA commented on HBASE-5081: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508259/patch_for_92.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestInstantSchemaChange org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/569//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/569//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/569//console This message is automatically generated. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174324#comment-13174324 ] Zhihong Yu commented on HBASE-5081: --- I can reproduce the following test failures on MacBook: {code} Failed tests: testTaskErr(org.apache.hadoop.hbase.master.TestSplitLogManager) testUnassignedTimeout(org.apache.hadoop.hbase.master.TestSplitLogManager) {code} Please adjust TestSplitLogManager accordingly. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5070) Constraints implementation and javadoc changes
[ https://issues.apache.org/jira/browse/HBASE-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174339#comment-13174339 ] jirapos...@reviews.apache.org commented on HBASE-5070: -- bq. On 2011-12-21 06:09:56, Michael Stack wrote: bq. This patch should go in. Good improvements. Doesn't address the bigger issues of Configuration inline w/ HTD -- have you tried it, I mean, IIRC, though you might have one custom config only, the whole Configuration will be output per Constraint? -- and adding support to shell. Those are in different issues? I'm going to try checking out the conf stuff for the Constraints today in the shell. Keep in mind that the conf stored per Constraint should just be configuration info for the Constraint, as opposed to the entire hbase conf (so we don't see all the values from hbase-site.xml, etc), but by default only see the enabled and priority configuration values. I'm expecting that to be pretty small, but I'll let you know. Also, there is HBASE-4879 to add shell support to Constraints as the original was already pretty massive (and taking a while) and the shell stuff could be cleanly separated out. I was working on it, but have temporarily diverted from it to work on the maven modularization (4336) as per Ted's request. - Jesse --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3273/#review4034 --- On 2011-12-20 19:14:46, Jesse Yates wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3273/ bq. --- bq. bq. (Updated 2011-12-20 19:14:46) bq. bq. bq. Review request for hbase, Gary Helmling, Ted Yu, and Michael Stack. bq. bq. bq. Summary bq. --- bq. bq. Follow-up on changes to constraint as per stack's comments on HBASE-4605. bq. bq. bq. This addresses bug HBASE-5070. bq. https://issues.apache.org/jira/browse/HBASE-5070 bq. bq. bq. Diffs bq. - bq. bq.src/docbkx/book.xml bd3f881 bq.src/main/java/org/apache/hadoop/hbase/constraint/BaseConstraint.java 7ce6d45 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraint.java 2d8b4d7 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraints.java 7825466 bq.src/main/java/org/apache/hadoop/hbase/constraint/package-info.java 6145ed5 bq. src/test/java/org/apache/hadoop/hbase/constraint/CheckConfigurationConstraint.java c49098d bq. bq. Diff: https://reviews.apache.org/r/3273/diff bq. bq. bq. Testing bq. --- bq. bq. mvn clean test -P localTests -Dest=*Constraint* - all tests pass. bq. bq. bq. Thanks, bq. bq. Jesse bq. bq. Constraints implementation and javadoc changes -- Key: HBASE-5070 URL: https://issues.apache.org/jira/browse/HBASE-5070 Project: HBase Issue Type: Task Reporter: Zhihong Yu This is continuation of HBASE-4605 See Stack's comments https://reviews.apache.org/r/2579/#review3980 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5033: --- Attachment: D933.4.patch Liyin updated the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Reviewers: Kannan, mbautin, Karthik, JIRA Implement the proposal described in the previous comments. REVISION DETAIL https://reviews.facebook.net/D933 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/HConstants.java src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/Store.java src/main/java/org/apache/hadoop/hbase/util/Threads.java Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch, D933.4.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5084) Allow different HTable instances to share one ExecutorService
Allow different HTable instances to share one ExecutorService - Key: HBASE-5084 URL: https://issues.apache.org/jira/browse/HBASE-5084 Project: HBase Issue Type: Task Reporter: Zhihong Yu This came out of Lily 1.1.1 release: Use a shared ExecutorService for all HTable instances, leading to better (or actual) thread reuse -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174345#comment-13174345 ] Mubarak Seyed commented on HBASE-4720: -- @Ted: {{eclipse_formatter_apache.xml}} does not work for long lines. Do you mind to post your eclipse formatter rule file? Thanks. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174351#comment-13174351 ] Zhihong Yu commented on HBASE-4720: --- I used the formatter provided by Nicolas. Please adjust line length manually. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5084) Allow different HTable instances to share one ExecutorService
[ https://issues.apache.org/jira/browse/HBASE-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174350#comment-13174350 ] Zhihong Yu commented on HBASE-5084: --- We already have: {code} public HTable(final byte[] tableName, final HConnection connection, final ExecutorService pool) throws IOException { {code} where connection cannot be null. Shall we relax the restriction on connection not being null ? Allow different HTable instances to share one ExecutorService - Key: HBASE-5084 URL: https://issues.apache.org/jira/browse/HBASE-5084 Project: HBase Issue Type: Task Reporter: Zhihong Yu This came out of Lily 1.1.1 release: Use a shared ExecutorService for all HTable instances, leading to better (or actual) thread reuse -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174357#comment-13174357 ] Jimmy Xiang commented on HBASE-5081: I am thinking about sync delete for the failure case. What do you think? I am adjusting the test failure now. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174365#comment-13174365 ] Phabricator commented on HBASE-4916: jdcryans has commented on the revision HBASE-4916 [jira] LoadTest MR Job. I tried using it on my 0.92 cluster, had to backport HexStringSplit but no biggie, but I encountered a few usability issues. - Why is -zk mandatory? It seems that by default it should use the one provided in HBaseConfiguration. - There is no help, not even for basic usage like when I tried to use it the first time without arguments all I got was this and it doesn't tell me what the switch is: jdcryans@sv4r11s38:~/hbase$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.LoadTest ZooKeeper quorum must be specified - The workloads need their own documentation too, currently unless you dig in the code you won't know how to properly tune them. - VersionWorkloadGenerator and MixedWorkloadGenerator don't work with their default values, they both hit this in their reducers and the reason is that delayNS ends up at 0: 11/12/21 19:45:43 INFO mapred.JobClient: Task Id : attempt_201112142134_0063_r_08_2, Status : FAILED java.lang.IllegalArgumentException at java.util.concurrent.ScheduledThreadPoolExecutor.scheduleAtFixedRate(ScheduledThreadPoolExecutor.java:420) at org.apache.hadoop.hbase.loadtest.Workload$Executor.start(Workload.java:229) at org.apache.hadoop.hbase.mapreduce.LoadTest$Reduce.reduce(LoadTest.java:158) at org.apache.hadoop.hbase.mapreduce.LoadTest$Reduce.reduce(LoadTest.java:124) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1154) at org.apache.hadoop.mapred.Child.main(Child.java:264) - Finally, and I'm not sure if you can do something about it, some tasks fail in this way on my cluster: 11/12/21 19:45:31 INFO mapred.JobClient: Task Id : attempt_201112142134_0063_r_08_1, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:242) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229) And in the stderr: Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 8991; nested exception is: java.net.BindException: Address already in use It would be nice if it was bubbled up, but I understand if you don't have control over it. REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Status: Open (was: Patch Available) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Status: Patch Available (was: Open) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5081: --- Attachment: hbase-5081-patch-v6.txt Fixed the unit test. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174401#comment-13174401 ] Phabricator commented on HBASE-4218: tedyu has commented on the revision [jira] [HBASE-4218] HFile data block encoding (delta encoding). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java:42 Should read 'have been created' src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java:49 I think delta should be removed here to be consistent with new naming convention I like the javadoc in HColumnDescriptor.java @ line 601 - it is more detailed. REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.10.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174402#comment-13174402 ] Hadoop QA commented on HBASE-5081: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508297/hbase-5081-patch-v6.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/573//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/573//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/573//console This message is automatically generated. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5085) fix test-patch script from setting the ulimit
fix test-patch script from setting the ulimit - Key: HBASE-5085 URL: https://issues.apache.org/jira/browse/HBASE-5085 Project: HBase Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan test-patch.sh script sets the ulimit -n 1024 just after triggering the patch setting this overrides the underlying systems ulimit and hence failing the hbase tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5085) fix test-patch script from setting the ulimit
[ https://issues.apache.org/jira/browse/HBASE-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated HBASE-5085: -- Attachment: hbase-5085.patch patch removes the ulimit setting in the test-patch script fix test-patch script from setting the ulimit - Key: HBASE-5085 URL: https://issues.apache.org/jira/browse/HBASE-5085 Project: HBase Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: hbase-5085.patch test-patch.sh script sets the ulimit -n 1024 just after triggering the patch setting this overrides the underlying systems ulimit and hence failing the hbase tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174428#comment-13174428 ] Jimmy Xiang commented on HBASE-5081: Upload patch v6 to review board. +assertTrue(ZKUtil.checkExists(zkw, tasknode) != -1); The original is this: +assertTrue(ZKUtil.checkExists(zkw, tasknode) == -1); This assertion can be satisfied when there is no race without this patch. With this patch, we don't delete any failed task node. So the node should be there now. I am still working on a patch to delete the node synchronously in this scenario. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5085) fix test-patch script from setting the ulimit
[ https://issues.apache.org/jira/browse/HBASE-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174431#comment-13174431 ] Zhihong Yu commented on HBASE-5085: --- +1 on patch. Let's see what result Hadoop QA would give us based on this patch. fix test-patch script from setting the ulimit - Key: HBASE-5085 URL: https://issues.apache.org/jira/browse/HBASE-5085 Project: HBase Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: hbase-5085.patch test-patch.sh script sets the ulimit -n 1024 just after triggering the patch setting this overrides the underlying systems ulimit and hence failing the hbase tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174433#comment-13174433 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4059 --- Test change looks good to me. - Lars On 2011-12-21 21:30:56, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 21:30:56) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq.src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 5c9d7dd bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5085) fix test-patch script from setting the ulimit
[ https://issues.apache.org/jira/browse/HBASE-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5085: -- Status: Patch Available (was: Open) fix test-patch script from setting the ulimit - Key: HBASE-5085 URL: https://issues.apache.org/jira/browse/HBASE-5085 Project: HBase Issue Type: Bug Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: hbase-5085.patch test-patch.sh script sets the ulimit -n 1024 just after triggering the patch setting this overrides the underlying systems ulimit and hence failing the hbase tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5081: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508264/patch_for_92_v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.replication.TestMasterReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/570//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/570//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/570//console This message is automatically generated.) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174435#comment-13174435 ] Hadoop QA commented on HBASE-5081: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508273/hbase-5081_patch_v5.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestCoprocessorEndpoint org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/572//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/572//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/572//console This message is automatically generated. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174438#comment-13174438 ] Zhihong Yu commented on HBASE-5081: --- Do we have a test case where zk nodes (including the failed task node) survive cluster restart and distributed log splitting successfully finishes after cluster restart ? If we use patch v6 instead of the synchronous node deletion patch (upcoming), we should check the value of resubmit_threshold. If the value is 0, we should still delete the failed task node because there would be no resubmission. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5081: -- Comment: was deleted (was: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ --- (Updated 2011-12-21 17:01:22.901024) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes --- Updated the comments. Summary --- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081. https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing --- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy ) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5081: -- Comment: was deleted (was: bq. On 2011-12-21 17:06:22, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 358 bq. https://reviews.apache.org/r/3292/diff/2/?file=65660#file65660line358 bq. bq. White space on end and 'is' should be 'if' Let me fix it. - Jimmy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4046 --- On 2011-12-21 17:01:22, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-21 17:01:22) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. ) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174456#comment-13174456 ] Phabricator commented on HBASE-5033: lhofhansl has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. I guess I had envisioned that we'd somehow share the same threadpool, but this works. LGTM. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/HConstants.java:193 I think we don't use camel case in these properties. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch, D933.4.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4439) Move ClientScanner out of HTable
[ https://issues.apache.org/jira/browse/HBASE-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174459#comment-13174459 ] Lars Hofhansl commented on HBASE-4439: -- I'm still interesting in committing this. I'll rebase to trunk. I'll remove the extra copy of the Scan in ClientScanner (but add it as a caveat in the javadoc instead). HTable still makes a copy of the Scan (so we can mark HBASE-1774 as a dup of this). Move ClientScanner out of HTable Key: HBASE-4439 URL: https://issues.apache.org/jira/browse/HBASE-4439 Project: HBase Issue Type: Improvement Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 4439-v1.txt, 4439.txt See HBASE-1935 for motivation. ClientScanner should be able to exist outside of HTable. While we're at it, we can also add an abstract client scanner to easy development of new client side scanners (such as parallel scanners, or per region scanners). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174460#comment-13174460 ] Phabricator commented on HBASE-5033: tedyu has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:807 What's the likelihood that store opening and closing activities happen at the same time ? I suggest reusing storeOpenerThreadPool here. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch, D933.4.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5081: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508273/hbase-5081_patch_v5.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestCoprocessorEndpoint org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/572//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/572//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/572//console This message is automatically generated.) Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4916) LoadTest MR Job
[ https://issues.apache.org/jira/browse/HBASE-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174473#comment-13174473 ] Phabricator commented on HBASE-4916: jdcryans has commented on the revision HBASE-4916 [jira] LoadTest MR Job. Doing further testing, I got this while running an online merge (that took 1721 seconds to run because it counts parent regions): Exception in thread pool-2-thread-186 java.lang.NullPointerException at org.apache.hadoop.hbase.loadtest.RandomDataGenerator.verify(RandomDataGenerator.java:106) at org.apache.hadoop.hbase.loadtest.GetOperation.postAction(GetOperation.java:45) at org.apache.hadoop.hbase.loadtest.Operation.run(Operation.java:126) at org.apache.hadoop.hbase.loadtest.Workload$Executor$BoundedExecutor$1.run(Workload.java:269) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) REVISION DETAIL https://reviews.facebook.net/D741 LoadTest MR Job --- Key: HBASE-4916 URL: https://issues.apache.org/jira/browse/HBASE-4916 Project: HBase Issue Type: Sub-task Components: client, regionserver Reporter: Nicolas Spiegelberg Assignee: Christopher Gist Fix For: 0.94.0 Attachments: HBASE-4916.D741.1.patch Add a script to start a streaming map-reduce job where each map tasks runs an instance of the load tester for a partition of the key-space. Ensure that the load tester takes a parameter indicating the start key for write operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1774) HTable$ClientScanner modifies its input parameters
[ https://issues.apache.org/jira/browse/HBASE-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174481#comment-13174481 ] Lars Hofhansl commented on HBASE-1774: -- Unfortunately making a copy is no longer possible in trunk because of ScanMetrics, which are collected on the Scan object itself. If we make a copy the copy will have its metrics populated. Hrrmmmfff. Could keep a reference to copied copied, but then the application may have to walk a chain of these references. I think we should just document this at this point. I'm in this code right now anyway (HBASE-4439)... I'll add JavaDoc for this. HTable$ClientScanner modifies its input parameters -- Key: HBASE-1774 URL: https://issues.apache.org/jira/browse/HBASE-1774 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.20.0 Reporter: Jim Kellerman Assignee: Jim Kellerman Priority: Critical HTable$ClientScanner modifies the Scan that is passed to it on construction. I would consider this to be bad programming practice because if I wanted to use the same Scan object to scan multiple tables, I would not expect one table scan to effect the other, but it does. If input parameters are going to be modified either now or later it should be called out *loudly* in the javadoc. The only way I found this behavior was by creating an application that did scan multiple tables using the same Scan object and having 'wierd stuff' happen. In my opinion, if you want to modify a field in an input parameter, you should: - make a copy of the original object - optionally return a reference to the copy. There is no javadoc about this behavior. The only thing I found was a comment in HTable$ClientScanner: {code} // HEADSUP: The scan internal start row can change as we move through table. {code} Is there a use case that requires this behavior? If so, I would recommend that ResultScanner (and the classes that implement it) provide an accessor to the mutable copy of the input Scan and leave the input argument alone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4439) Move ClientScanner out of HTable
[ https://issues.apache.org/jira/browse/HBASE-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174486#comment-13174486 ] Lars Hofhansl commented on HBASE-4439: -- See my comment in HBASE-1774. Turns out as it stands Scan objects cannot be copied as the ScanMetrics are collected into the Scan object. Move ClientScanner out of HTable Key: HBASE-4439 URL: https://issues.apache.org/jira/browse/HBASE-4439 Project: HBase Issue Type: Improvement Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 4439-v1.txt, 4439.txt See HBASE-1935 for motivation. ClientScanner should be able to exist outside of HTable. While we're at it, we can also add an abstract client scanner to easy development of new client side scanners (such as parallel scanners, or per region scanners). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174517#comment-13174517 ] Mubarak Seyed commented on HBASE-4720: -- I addressed most of the review comments, will wait until get more review comments from Andrew (and others). {code} parameter update is not being used in other places as well. RowResoure.java: Response update(final CellSetModel model, final boolean replace) Response updateBinary(final byte[] message, final HttpHeaders headers, final boolean replace) ScannerResource.java: - Response update(final ScannerModel model, final boolean replace, final UriInfo uriInfo) { {code} Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4616) Update hregion encoded name to reduce logic and prevent region collisions in META
[ https://issues.apache.org/jira/browse/HBASE-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174518#comment-13174518 ] stack commented on HBASE-4616: -- Regards UUID, if its an md5 anyways, lets call it what it is. I tried to see if a hash that was smaller in size but it seems that even murmur hash v3 is 120 bits now. If md5, can get hash by doing 'md5sum tablename' on command line if have to; is there a uuid command-line tool so readily available? I was thinking too we could get base64 of the md5 so size in meta is less but thats probably more pain that its worth. Or, what would it take to keep a file of tablenames to int codes in the filesystem and use the zero-padded int code in .META. instead? All tables under value 10 would be catalog tables... .META. would be 0, etc. Could sit beside the hbase.version file at head of the filesystem; could call the file .hbase.tablenames.map or some such. UUID/md5 is nice in that it makes it so don't need this but they are wide compared to a short. Maybe UUID/md5 the way to go -- less complexity and the width is not really concern since we don't write tablename into keys (unless we are replicating I suppose; hmm... maybe UUID/md5 is safer because two clusters could have same code for different table names -- that'd get really confusing... if you stay w/ md5s, you have the same behavior we have now regards cross-dc and table names). I think we need to make a list of pros/cons removing tablename from meta still. In src/main/java/org/apache/hadoop/hbase/HRegionInfo.java why we import Using MetaReader inside in HRI is not on I'd say. Thats a base hbase class using a subpackage class that does client operations... HRI should be usable w/o need of a cluster/network. Ditto CatalogTracker. Dont do this.. Expand it: +import org.apache.hadoop.hbase.util.*; How is this new format going t play w/ old format? We have to convert all the old formats as part of migration into new format? What about the fact that the dir in the filesystem that holds region data is named for a hash of the HRI name? We can't rename all dirs in fs as part of migration? Or you think we should? You could add some javadoc here: * New region name format: - *lt;tablename,,lt;startkey,lt;regionIdTimestamp.lt;encodedName. + *lt;tablename,,lt;endkey,lt;regionIdTimestamp.lt;encodedName. * where, *lt;encodedName is a hex version of the MD5 hash of - *lt;tablename,lt;startkey,lt;regionIdTimestamp + *lt;tablename,lt;endkey,lt;regionIdTimestamp Isn't tablename the md5 of tablename now? What is this comment about? + // It should say, the tablename encoded in the region ends with 0x01, + // but the last region's tablename ends with 0x02 + public static final int END_OF_TABLE_NAME = 1; + public static final int END_OF_TABLE_NAME_FOR_EMPTY_ENDKEY = END_OF_TABLE_NAME + 1; Do you think these should be printable characters? Else, they'll show in terminal as a box or something? For example, if this end-of-tablename marker were a '.', and the last table had a ':', that'd sort properly and it wouldn't look so bad when you listed regions (it would be confusing having '.' for every region name but the last but less confusing that two unprintables that rendor one way for 1 and another for 2). Or you could have comma as end char and a period for the last region in a table (that makes a bit more 'sense') Lines like this are too long: +this.regionName = createRegionName(this.tableName, startKey, endKey, Long.toString(regionId).getBytes(), true); Fix your formatting in places like below: + boolean newFormat){ + // We need to be able to add a single byte easily to the regionName as we are building it up + // It's being used by appending delimiters and special markers. + +byte [] oneByte = new byte[1]; +//This cannot return any weird string chars as all uuid chars are a hex char. The first comment is too indented. There is a space between close paren and open of curly brace on function, etc. Use Bytes.toBytes instead of below if only for fact that its shorter statement: +byte[] uuidTableName = UUID.nameUUIDFromBytes(tableName).toString().getBytes(); If null, throw exception? Don't keep going? +int allocation = uuidTableName == null ? 2 : uuidTableName.length + 2;$ Explain why you are adding '2' in a comment? Why write +allocation += endKey == null ? 1 : endKey.length + 1; so? Why not as if (endKey == null) allocation += endKey.length; and then allocation += 1; // COMMENT EXPLAINING WHY THE +1? Yeah, if some of these items are null like id or tablename, it should be error? This don't seem necessary, the oneByte that is: + oneByte[0] = END_OF_TABLE_NAME_FOR_EMPTY_ENDKEY;$ Just put the END_OF_TABLE_NAME_FOR_EMPTY_ENDKEY into the byteArrayDataOutput.put(oneByte); Ditto
[jira] [Created] (HBASE-5086) Reopening a region on the a RS can leave it in PENDING_OPEN
Reopening a region on the a RS can leave it in PENDING_OPEN --- Key: HBASE-5086 URL: https://issues.apache.org/jira/browse/HBASE-5086 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Fix For: 0.92.1 I got this twice during the same test. If the region servers are slow enough and you run an online alter, it's possible for the RS to change the znode status to CLOSED and have the master send an OPEN before the region server is able to remove the region from it's list of RITs. This is what the master sees: {quote} 011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. (offlining) 2011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f in a CLOSING state 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r25s44,62023,1324494325099 for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r25s44,62023,1324494325099, region=43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. state=CLOSED, ts=1324506255629, server=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating (or updating) unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f with OFFLINE state 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. destination server is + sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.; plan=hri=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., src=, dest=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. to sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,664 ERROR org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment in: sv4r25s44,62023,1324494325099 due to org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:OPEN for the region:test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. ,which we are already trying to CLOSE. {quote} After that the master abandons. And the region server: {quote} 2011-12-21 22:24:09,523 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received close region: test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,523 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.: disabling compactions flushes 2011-12-21 22:24:09,524 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., current region memstore size 40.5m 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., commencing wait for mvcc, flushsize=42482936 2011-12-21 22:24:13,368 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://sv4r11s38:9100/hbase/test1/43123e2e3fc83ec25fe2a76b4f09077f/.tmp/87d6944c54c7417e9a34a9f9542bcb72 to hdfs://sv4r11s38:9100/hbase/test1/43123e2e3fc83ec25fe2a76b4f09077f/actions/87d6944c54c7417e9a34a9f9542bcb72 2011-12-21 22:24:13,568 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://sv4r11s38:9100/hbase/test1/43123e2e3fc83ec25fe2a76b4f09077f/actions/87d6944c54c7417e9a34a9f9542bcb72, entries=54209, sequenceid=31451012, memsize=40.5m, filesize=31.4m 2011-12-21 22:24:14,381 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of
[jira] [Updated] (HBASE-5086) Reopening a region on a RS can leave it in PENDING_OPEN
[ https://issues.apache.org/jira/browse/HBASE-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5086: -- Summary: Reopening a region on a RS can leave it in PENDING_OPEN (was: Reopening a region on the a RS can leave it in PENDING_OPEN) Reopening a region on a RS can leave it in PENDING_OPEN --- Key: HBASE-5086 URL: https://issues.apache.org/jira/browse/HBASE-5086 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Fix For: 0.92.1 I got this twice during the same test. If the region servers are slow enough and you run an online alter, it's possible for the RS to change the znode status to CLOSED and have the master send an OPEN before the region server is able to remove the region from it's list of RITs. This is what the master sees: {quote} 011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. (offlining) 2011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f in a CLOSING state 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r25s44,62023,1324494325099 for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r25s44,62023,1324494325099, region=43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. state=CLOSED, ts=1324506255629, server=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating (or updating) unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f with OFFLINE state 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. destination server is + sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.; plan=hri=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., src=, dest=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. to sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,664 ERROR org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment in: sv4r25s44,62023,1324494325099 due to org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:OPEN for the region:test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. ,which we are already trying to CLOSE. {quote} After that the master abandons. And the region server: {quote} 2011-12-21 22:24:09,523 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received close region: test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,523 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.: disabling compactions flushes 2011-12-21 22:24:09,524 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., current region memstore size 40.5m 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., commencing wait for mvcc, flushsize=42482936 2011-12-21 22:24:13,368 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://sv4r11s38:9100/hbase/test1/43123e2e3fc83ec25fe2a76b4f09077f/.tmp/87d6944c54c7417e9a34a9f9542bcb72 to
[jira] [Updated] (HBASE-4940) hadoop-metrics.properties can include configuration of the rest context for ganglia
[ https://issues.apache.org/jira/browse/HBASE-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mubarak Seyed updated HBASE-4940: - Attachment: HBASE-4940.trunk.v1.patch The attached file (HBASE-4940.trunk.v1.patch) is a patch for TRUNK. Thanks. hadoop-metrics.properties can include configuration of the rest context for ganglia - Key: HBASE-4940 URL: https://issues.apache.org/jira/browse/HBASE-4940 Project: HBase Issue Type: Improvement Components: metrics Affects Versions: 0.90.5 Environment: HBase-0.90.1 Reporter: Mubarak Seyed Priority: Minor Labels: hbase-rest Fix For: 0.94.0 Attachments: HBASE-4940.patch, HBASE-4940.trunk.v1.patch It appears from hadoop-metrics.properties that configuration for rest context is missing. It would be good if we add the rest context and commented out them, if anyone is using rest-server and if they want to monitor using ganglia context then they can uncomment the rest context and use them for rest-server monitoring using ganglia. {code} # Configuration of the rest context for ganglia #rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext #rest.period=10 #rest.servers=ganglia-metad-hostname:port {code} Working on the patch, will submit it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174532#comment-13174532 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- bq. On 2011-12-21 17:05:47, Michael Stack wrote: bq. Looks good to me. How I know it works? Would it be hard to write a test? bq. bq. Jimmy Xiang wrote: bq. This is a race issue hard to reproduce. Let me think about how to come up a unit test. Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test. - Michael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 --- On 2011-12-22 00:31:23, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-22 00:31:23) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 bq.src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174533#comment-13174533 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- bq. On 2011-12-21 17:05:47, Michael Stack wrote: bq. Looks good to me. How I know it works? Would it be hard to write a test? bq. bq. Jimmy Xiang wrote: bq. This is a race issue hard to reproduce. Let me think about how to come up a unit test. bq. bq. Michael Stack wrote: bq. Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test. Yes, it is in Jira: hbase-5081-patch-v7.txt Thanks. - Jimmy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 --- On 2011-12-22 00:31:23, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-22 00:31:23) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 bq.src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174534#comment-13174534 ] Hadoop QA commented on HBASE-5081: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508328/hbase-5081-patch-v7.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/575//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/575//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/575//console This message is automatically generated. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5087) Up the 0.92RC zk to 3.4.1RC0
Up the 0.92RC zk to 3.4.1RC0 Key: HBASE-5087 URL: https://issues.apache.org/jira/browse/HBASE-5087 Project: HBase Issue Type: Task Reporter: stack Fix For: 0.92.0 ZK just found bad bug in 3.4.1 zookeeper-1333. They put up a fix and new rc, 3.4.1 (Andrew, you saw Todds query asking if it'd possible to hold to zk 3.3.4 and just have 3.4.1 for secure installs?) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174539#comment-13174539 ] jirapos...@reviews.apache.org commented on HBASE-5081: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4064 --- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9163 Did the first approach not work? src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java https://reviews.apache.org/r/3292/#comment9164 It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed - Lars On 2011-12-22 00:31:23, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3292/ bq. --- bq. bq. (Updated 2011-12-22 00:31:23) bq. bq. bq. Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. bq. bq. bq. Summary bq. --- bq. bq. In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. bq. bq. It used to delete the node always. bq. bq. bq. This addresses bug HBASE-5081. bq. https://issues.apache.org/jira/browse/HBASE-5081 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 bq.src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 bq. bq. Diff: https://reviews.apache.org/r/3292/diff bq. bq. bq. Testing bq. --- bq. bq. mvn -Dtest=TestDistributedLogSplitting clean test bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174541#comment-13174541 ] Phabricator commented on HBASE-5033: Kannan has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Liyin-- This looks very good. Some minor comments inlined. Also: Initially, I was not sure if this parallelization is worth doing for close. But as you pointed out-- with EvictOnClose turned on, this parallelization might be useful for close scenario too. On a separate note, with EvictOnClose, it seems like there are two cases related to closing a region. a) cluster shutdown related region closes b) load balancing related region closes. For the cluster shutdown case, even if EvictOnClose is true, I think we should by-pass the evictions to avoid the cost of cleaning up the blocks individually from the block cache since the server is shutting down anyway. For the load balancing case or normal compaction related file deletion case, it makes sense to keep do the evictions on close. I don't think the code currently distinguishes the two cases... (but I haven't looked carefully to double check). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:529-541 might be good to make this block a helper function. This block of code is repeated 4 times with minor differences in params... src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:528 since 30 is repeated a few times, would be better to make a const for this too. And I would set the default much lower... to say 8 or something. Not sure how many disks is typical. src/main/java/org/apache/hadoop/hbase/regionserver/Store.java:294 Openner - Opener src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:527 plural form here (maxStoreOpenerThread - maxStoreOpenerThreads) and similarly in the other places. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch, D933.4.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-5087) Up the 0.92RC zk to 3.4.1RC0
[ https://issues.apache.org/jira/browse/HBASE-5087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-5087. -- Resolution: Fixed Assignee: stack Hadoop Flags: Incompatible change Committed as incompatible change to 0.92 and trunk (though 3.4.1 runs against 3.3.x... marking it so to give people pause) Up the 0.92RC zk to 3.4.1RC0 Key: HBASE-5087 URL: https://issues.apache.org/jira/browse/HBASE-5087 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Fix For: 0.92.0 Attachments: 5087.txt ZK just found bad bug in 3.4.1 zookeeper-1333. They put up a fix and new rc, 3.4.1 (Andrew, you saw Todds query asking if it'd possible to hold to zk 3.3.4 and just have 3.4.1 for secure installs?) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira