[jira] [Commented] (HBASE-5611) Replayed edits from regions that failed to open during recovery aren't removed from the global MemStore size
[ https://issues.apache.org/jira/browse/HBASE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263904#comment-13263904 ] Ted Yu commented on HBASE-5611: --- I ran 0.92 test suite over 0.92 patch and didn't find regression. Will integrate tomorrow if there is no objection. Replayed edits from regions that failed to open during recovery aren't removed from the global MemStore size Key: HBASE-5611 URL: https://issues.apache.org/jira/browse/HBASE-5611 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Jean-Daniel Cryans Assignee: Jieshan Bean Priority: Critical Fix For: 0.90.7, 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-5611-92.patch, HBASE-5611-94-minorchange.patch, HBASE-5611-trunk-v2-minorchange.patch This bug is rather easy to get if the {{TimeoutMonitor}} is on, else I think it's still possible to hit it if a region fails to open for more obscure reasons like HDFS errors. Consider a region that just went through distributed splitting and that's now being opened by a new RS. The first thing it does is to read the recovery files and put the edits in the {{MemStores}}. If this process takes a long time, the master will move that region away. At that point the edits are still accounted for in the global {{MemStore}} size but they are dropped when the {{HRegion}} gets cleaned up. It's completely invisible until the {{MemStoreFlusher}} needs to force flush a region and that none of them have edits: {noformat} 2012-03-21 00:33:39,303 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=5.9g 2012-03-21 00:33:39,303 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed for entry null java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:129) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:199) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:223) at java.lang.Thread.run(Thread.java:662) {noformat} The {{null}} here is a region. In my case I had so many edits in the {{MemStore}} during recovery that I'm over the low barrier although in fact I'm at 0. It happened yesterday and it still printing this out. To fix this we need to be able to decrease the global {{MemStore}} size when the region can't open. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5890) SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS
[ https://issues.apache.org/jira/browse/HBASE-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263930#comment-13263930 ] Ted Yu commented on HBASE-5890: --- Patch looks good. In the catch clause, the following statement should be added: {code} Thread.currentThread().interrupt(); {code} SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS Key: HBASE-5890 URL: https://issues.apache.org/jira/browse/HBASE-5890 Project: HBase Issue Type: Bug Reporter: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0, 0.96.0, 0.89-fb Attachments: HBASE-5890.patch We ran into a production issue yesterday where the SplitLogManager tried to create a Rescan node in ZK. The createAsync() generated a KeeperException.CONNECTIONLOSS that was immedately sent to processResult(), createRescan node with --retry_count was called, and this created a CPU busywait that also clogged up the logs. We should handle this better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5826) Improve sync of HLog edits
[ https://issues.apache.org/jira/browse/HBASE-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-5826: -- Attachment: (was: 5826.txt) Improve sync of HLog edits -- Key: HBASE-5826 URL: https://issues.apache.org/jira/browse/HBASE-5826 Project: HBase Issue Type: Improvement Reporter: Zhihong Yu Attachments: 5826.txt HBASE-5782 solved the correctness issue for the sync of HLog edits. Todd provided a patch that would achieve higher throughput. This JIRA is a continuation of Todd's work submitted there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5826) Improve sync of HLog edits
[ https://issues.apache.org/jira/browse/HBASE-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-5826: -- Attachment: 5826.txt Improve sync of HLog edits -- Key: HBASE-5826 URL: https://issues.apache.org/jira/browse/HBASE-5826 Project: HBase Issue Type: Improvement Reporter: Zhihong Yu Attachments: 5826.txt HBASE-5782 solved the correctness issue for the sync of HLog edits. Todd provided a patch that would achieve higher throughput. This JIRA is a continuation of Todd's work submitted there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5712) Parallelize load of .regioninfo files in diagnostic/repair portion of hbck.
[ https://issues.apache.org/jira/browse/HBASE-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264048#comment-13264048 ] Ted Yu commented on HBASE-5712: --- The new sentence is better. +1. Parallelize load of .regioninfo files in diagnostic/repair portion of hbck. --- Key: HBASE-5712 URL: https://issues.apache.org/jira/browse/HBASE-5712 Project: HBase Issue Type: Sub-task Components: hbck Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Attachments: hbase-5712-90.patch, hbase-5712.patch On heavily loaded hdfs's some dfs nodes may not respond quickly and backs off for 60s before attempting to read data from another datanode. Portions of the information gathered from hdfs (.regioninfo files) are loaded serially. With HBase with clusters with 100's, or 1000's, or 1's regions encountering these 60s delay blocks progress and can be very painful. There is already some parallelization of portions of the hdfs information load operations and the goal here is move the reading of .regioninfos into the parallelized sections.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5885) Invalid HFile block magic on Local file System
[ https://issues.apache.org/jira/browse/HBASE-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-5885: -- Attachment: 5885-trunk-v2.txt Wrong patch was attached as trunk-1. Re-attaching patch for trunk. Invalid HFile block magic on Local file System -- Key: HBASE-5885 URL: https://issues.apache.org/jira/browse/HBASE-5885 Project: HBase Issue Type: Bug Affects Versions: 0.94.0, 0.96.0 Reporter: Elliott Clark Assignee: Elliott Clark Priority: Blocker Fix For: 0.94.0 Attachments: 5885-trunk-v2.txt, HBASE-5885-94-0.patch, HBASE-5885-94-1.patch, HBASE-5885-trunk-0.patch, HBASE-5885-trunk-1.patch ERROR: java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: Thu Apr 26 11:19:18 PDT 2012, org.apache.hadoop.hbase.client.ScannerCallable@190a621a, java.io.IOException: java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader reader=file:/tmp/hbase-eclark/hbase/TestTable/e2d1c846363c75262cbfd85ea278b342/info/bae2681d63734066957b58fe791a0268, compression=none, cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false], firstKey=01/info:data/1335463981520/Put, lastKey=0002588100/info:data/1335463902296/Put, avgKeyLen=30, avgValueLen=1000, entries=1215085, length=1264354417, cur=000248/info:data/1335463994457/Put/vlen=1000/ts=0] at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:135) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:95) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:368) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3323) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3279) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3296) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2393) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376) Caused by: java.io.IOException: Invalid HFile block magic: \xEC\xD5\x9D\xB4\xC2bfo at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:153) at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:164) at org.apache.hadoop.hbase.io.hfile.HFileBlock.init(HFileBlock.java:254) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1779) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1637) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:327) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.readNextDataBlock(HFileReaderV2.java:555) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:651) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:130) ... 12 more Thu Apr 26 11:19:19 PDT 2012, org.apache.hadoop.hbase.client.ScannerCallable@190a621a, java.io.IOException: java.io.IOException: java.lang.IllegalArgumentException at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1132) at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1121) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2420) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376) Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:630)
[jira] [Updated] (HBASE-5885) Invalid HFile block magic on Local file System
[ https://issues.apache.org/jira/browse/HBASE-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-5885: -- Fix Version/s: 0.96.0 Hadoop Flags: Reviewed Invalid HFile block magic on Local file System -- Key: HBASE-5885 URL: https://issues.apache.org/jira/browse/HBASE-5885 Project: HBase Issue Type: Bug Affects Versions: 0.94.0, 0.96.0 Reporter: Elliott Clark Assignee: Elliott Clark Priority: Blocker Fix For: 0.94.0, 0.96.0 Attachments: 5885-trunk-v2.txt, HBASE-5885-94-0.patch, HBASE-5885-94-1.patch, HBASE-5885-trunk-0.patch, HBASE-5885-trunk-1.patch ERROR: java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=7, exceptions: Thu Apr 26 11:19:18 PDT 2012, org.apache.hadoop.hbase.client.ScannerCallable@190a621a, java.io.IOException: java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader reader=file:/tmp/hbase-eclark/hbase/TestTable/e2d1c846363c75262cbfd85ea278b342/info/bae2681d63734066957b58fe791a0268, compression=none, cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false], firstKey=01/info:data/1335463981520/Put, lastKey=0002588100/info:data/1335463902296/Put, avgKeyLen=30, avgValueLen=1000, entries=1215085, length=1264354417, cur=000248/info:data/1335463994457/Put/vlen=1000/ts=0] at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:135) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:95) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:368) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3323) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3279) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3296) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2393) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376) Caused by: java.io.IOException: Invalid HFile block magic: \xEC\xD5\x9D\xB4\xC2bfo at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:153) at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:164) at org.apache.hadoop.hbase.io.hfile.HFileBlock.init(HFileBlock.java:254) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1779) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1637) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:327) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.readNextDataBlock(HFileReaderV2.java:555) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:651) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:130) ... 12 more Thu Apr 26 11:19:19 PDT 2012, org.apache.hadoop.hbase.client.ScannerCallable@190a621a, java.io.IOException: java.io.IOException: java.lang.IllegalArgumentException at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1132) at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:1121) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2420) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376) Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:630) at
[jira] [Commented] (HBASE-5890) SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS
[ https://issues.apache.org/jira/browse/HBASE-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264205#comment-13264205 ] Ted Yu commented on HBASE-5890: --- Prakash made a good point. {code} this.timeoutMonitor = new TimeoutMonitor( conf.getInt(hbase.splitlog.manager.timeoutmonitor.period, 1000), {code} TimeoutMonitor runs at interval longer than socket timeout. If the default 1 sec interval (for TimeoutMonitor) is acceptable for the delay, this approach would work. SplitLog Rescan BusyWaits upon Zk.CONNECTIONLOSS Key: HBASE-5890 URL: https://issues.apache.org/jira/browse/HBASE-5890 Project: HBase Issue Type: Bug Reporter: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0, 0.96.0, 0.89-fb Attachments: HBASE-5890.patch We ran into a production issue yesterday where the SplitLogManager tried to create a Rescan node in ZK. The createAsync() generated a KeeperException.CONNECTIONLOSS that was immedately sent to processResult(), createRescan node with --retry_count was called, and this created a CPU busywait that also clogged up the logs. We should handle this better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5611) Replayed edits from regions that failed to open during recovery aren't removed from the global MemStore size
[ https://issues.apache.org/jira/browse/HBASE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264210#comment-13264210 ] Ted Yu commented on HBASE-5611: --- Integrated to 0.92, 0.94 and trunk. Thanks for the patch, Jieshan. Thanks for the review, Lars. Replayed edits from regions that failed to open during recovery aren't removed from the global MemStore size Key: HBASE-5611 URL: https://issues.apache.org/jira/browse/HBASE-5611 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Jean-Daniel Cryans Assignee: Jieshan Bean Priority: Critical Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5611-92.patch, HBASE-5611-94-minorchange.patch, HBASE-5611-trunk-v2-minorchange.patch This bug is rather easy to get if the {{TimeoutMonitor}} is on, else I think it's still possible to hit it if a region fails to open for more obscure reasons like HDFS errors. Consider a region that just went through distributed splitting and that's now being opened by a new RS. The first thing it does is to read the recovery files and put the edits in the {{MemStores}}. If this process takes a long time, the master will move that region away. At that point the edits are still accounted for in the global {{MemStore}} size but they are dropped when the {{HRegion}} gets cleaned up. It's completely invisible until the {{MemStoreFlusher}} needs to force flush a region and that none of them have edits: {noformat} 2012-03-21 00:33:39,303 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=5.9g 2012-03-21 00:33:39,303 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed for entry null java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:129) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:199) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:223) at java.lang.Thread.run(Thread.java:662) {noformat} The {{null}} here is a region. In my case I had so many edits in the {{MemStore}} during recovery that I'm over the low barrier although in fact I'm at 0. It happened yesterday and it still printing this out. To fix this we need to be able to decrease the global {{MemStore}} size when the region can't open. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5876) TestImportExport has been failing against hadoop 0.23 profile
[ https://issues.apache.org/jira/browse/HBASE-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264212#comment-13264212 ] Ted Yu commented on HBASE-5876: --- From https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK-on-Hadoop-23/156/testReport/org.apache.hadoop.hbase.mapreduce/TestImportTsv/testMROnTableWithCustomMapper/: {code} 2012-04-27 00:47:47,745 ERROR [pool-1-thread-1] service.CompositeService(72): Error starting services org.apache.hadoop.mapred.MiniMRCluster org.apache.hadoop.yarn.YarnException: java.io.IOException: ResourceManager failed to start. Final state is INITED at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.start(MiniYARNCluster.java:152) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) at org.apache.hadoop.mapred.MiniMRClientClusterFactory.create(MiniMRClientClusterFactory.java:67) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:180) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:170) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:162) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:154) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:147) {code} From https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK-on-Hadoop-23/156/testReport/org.apache.hadoop.hbase.mapreduce/TestImportExport/testWithDeletes/: {code} java.io.FileNotFoundException: File does not exist: /home/hudson/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/0.23.2-SNAPSHOT/hadoop-mapreduce-client-core-0.23.2-SNAPSHOT.jar at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:729) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:208) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:71) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:246) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:284) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:355) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1221) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1218) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1239) at org.apache.hadoop.hbase.mapreduce.TestImportExport.testWithDeletes(TestImportExport.java:181) {code} TestImportExport has been failing against hadoop 0.23 profile - Key: HBASE-5876 URL: https://issues.apache.org/jira/browse/HBASE-5876 Project: HBase Issue Type: Bug Reporter: Zhihong Yu Assignee: Jonathan Hsieh TestImportExport has been failing against hadoop 0.23 profile -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5416: -- Attachment: 5416-Filtered_scans_v6.patch Rebased Max's latest patch on trunk. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: filters, performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Max Lapan Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5974) Scanner retry behavior with RPC timeout on next() seems incorrect
[ https://issues.apache.org/jira/browse/HBASE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5974: -- Attachment: 5974_94-V4.patch Patch v4 makes a small change to JVMClusterUtil.java so that RegionServerWithScanTimeout can be made private. TestClientScannerRPCTimeout passes. Scanner retry behavior with RPC timeout on next() seems incorrect - Key: HBASE-5974 URL: https://issues.apache.org/jira/browse/HBASE-5974 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0 Reporter: Todd Lipcon Assignee: Anoop Sam John Priority: Critical Fix For: 0.94.1 Attachments: 5974_94-V4.patch, HBASE-5974_0.94.patch, HBASE-5974_94-V2.patch, HBASE-5974_94-V3.patch I'm seeing the following behavior: - set RPC timeout to a short value - call next() for some batch of rows, big enough so the client times out before the result is returned - the HConnectionManager stuff will retry the next() call to the same server. At this point, one of two things can happen: 1) the previous next() call will still be processing, in which case you get a LeaseException, because it was removed from the map during the processing, or 2) the next() call will succeed but skip the prior batch of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6138) HadoopQA not running findbugs [Trunk]
[ https://issues.apache.org/jira/browse/HBASE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6138: -- Attachment: 6138-addendum.txt Addendum provided by Jesse. HadoopQA not running findbugs [Trunk] - Key: HBASE-6138 URL: https://issues.apache.org/jira/browse/HBASE-6138 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.96.0 Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 0.96.0 Attachments: 6138-addendum.txt, 6138.txt HadoopQA shows like -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. But not able to see any reports link When I checked the console output for the build I can see {code} [INFO] --- findbugs-maven-plugin:2.4.0:findbugs (default-cli) @ hbase-common --- [INFO] Fork Value is true [INFO] [INFO] Reactor Summary: [INFO] [INFO] HBase . SUCCESS [1.890s] [INFO] HBase - Common FAILURE [2.238s] [INFO] HBase - Server SKIPPED [INFO] HBase - Assembly .. SKIPPED [INFO] HBase - Site .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 4.856s [INFO] Finished at: Thu May 31 03:35:35 UTC 2012 [INFO] Final Memory: 23M/154M [INFO] [ERROR] Could not find resource '${parent.basedir}/dev-support/findbugs-exclude.xml'. - [Help 1] [ERROR] {code} Because of this error Findbugs is getting run! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6067: -- Fix Version/s: 0.92.3 0.94.1 HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5416: -- Attachment: (was: 5416-Filtered_scans_v6.patch) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: filters, performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Max Lapan Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5416: -- Attachment: 5416-Filtered_scans_v6.patch Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: filters, performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Max Lapan Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288218#comment-13288218 ] Zhihong Ted Yu commented on HBASE-6067: --- Integrated to trunk first. Thanks for the review, Eli. HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288248#comment-13288248 ] Zhihong Ted Yu commented on HBASE-6067: --- Integrated to 0.94 Will wait for release of 0.92.2 before integrating to 0.92 branch. HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-6138) HadoopQA not running findbugs [Trunk]
[ https://issues.apache.org/jira/browse/HBASE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu reassigned HBASE-6138: - Assignee: Jesse Yates (was: Anoop Sam John) HadoopQA not running findbugs [Trunk] - Key: HBASE-6138 URL: https://issues.apache.org/jira/browse/HBASE-6138 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.96.0 Reporter: Anoop Sam John Assignee: Jesse Yates Fix For: 0.96.0 Attachments: 6138-addendum.txt, 6138.txt HadoopQA shows like -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. But not able to see any reports link When I checked the console output for the build I can see {code} [INFO] --- findbugs-maven-plugin:2.4.0:findbugs (default-cli) @ hbase-common --- [INFO] Fork Value is true [INFO] [INFO] Reactor Summary: [INFO] [INFO] HBase . SUCCESS [1.890s] [INFO] HBase - Common FAILURE [2.238s] [INFO] HBase - Server SKIPPED [INFO] HBase - Assembly .. SKIPPED [INFO] HBase - Site .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 4.856s [INFO] Finished at: Thu May 31 03:35:35 UTC 2012 [INFO] Final Memory: 23M/154M [INFO] [ERROR] Could not find resource '${parent.basedir}/dev-support/findbugs-exclude.xml'. - [Help 1] [ERROR] {code} Because of this error Findbugs is getting run! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-6138) HadoopQA not running findbugs [Trunk]
[ https://issues.apache.org/jira/browse/HBASE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu resolved HBASE-6138. --- Resolution: Fixed Hadoop Flags: Reviewed From QA report of HBASE-5416: {code} Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2095//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2095//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html {code} HadoopQA not running findbugs [Trunk] - Key: HBASE-6138 URL: https://issues.apache.org/jira/browse/HBASE-6138 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.96.0 Reporter: Anoop Sam John Assignee: Jesse Yates Fix For: 0.96.0 Attachments: 6138-addendum.txt, 6138.txt HadoopQA shows like -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. But not able to see any reports link When I checked the console output for the build I can see {code} [INFO] --- findbugs-maven-plugin:2.4.0:findbugs (default-cli) @ hbase-common --- [INFO] Fork Value is true [INFO] [INFO] Reactor Summary: [INFO] [INFO] HBase . SUCCESS [1.890s] [INFO] HBase - Common FAILURE [2.238s] [INFO] HBase - Server SKIPPED [INFO] HBase - Assembly .. SKIPPED [INFO] HBase - Site .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 4.856s [INFO] Finished at: Thu May 31 03:35:35 UTC 2012 [INFO] Final Memory: 23M/154M [INFO] [ERROR] Could not find resource '${parent.basedir}/dev-support/findbugs-exclude.xml'. - [Help 1] [ERROR] {code} Because of this error Findbugs is getting run! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5974) Scanner retry behavior with RPC timeout on next() seems incorrect
[ https://issues.apache.org/jira/browse/HBASE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288318#comment-13288318 ] Zhihong Ted Yu commented on HBASE-5974: --- w.r.t. potential change to RegionScanner, if users create wrapper(s), the maintenance of seqNo would still be completed by core implementation. See the following: {code} @InterfaceAudience.Private public interface RegionScanner extends InternalScanner { {code} Scanner retry behavior with RPC timeout on next() seems incorrect - Key: HBASE-5974 URL: https://issues.apache.org/jira/browse/HBASE-5974 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0 Reporter: Todd Lipcon Assignee: Anoop Sam John Priority: Critical Fix For: 0.94.1 Attachments: 5974_94-V4.patch, HBASE-5974_0.94.patch, HBASE-5974_94-V2.patch, HBASE-5974_94-V3.patch I'm seeing the following behavior: - set RPC timeout to a short value - call next() for some batch of rows, big enough so the client times out before the result is returned - the HConnectionManager stuff will retry the next() call to the same server. At this point, one of two things can happen: 1) the previous next() call will still be processing, in which case you get a LeaseException, because it was removed from the map during the processing, or 2) the next() call will succeed but skip the prior batch of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-5974) Scanner retry behavior with RPC timeout on next() seems incorrect
[ https://issues.apache.org/jira/browse/HBASE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288318#comment-13288318 ] Zhihong Ted Yu edited comment on HBASE-5974 at 6/4/12 4:18 AM: --- w.r.t. potential change to RegionScanner, if users create wrapper(s), the maintenance of seqNo would still be completed by core implementation. See the following in patch: {code} + public Result[] next(final long scannerId, int nbRows) throws IOException { +return next(scannerId, nbRows, -1); + } {code} and the following in current code base: {code} @InterfaceAudience.Private public interface RegionScanner extends InternalScanner { {code} was (Author: zhi...@ebaysf.com): w.r.t. potential change to RegionScanner, if users create wrapper(s), the maintenance of seqNo would still be completed by core implementation. See the following: {code} @InterfaceAudience.Private public interface RegionScanner extends InternalScanner { {code} Scanner retry behavior with RPC timeout on next() seems incorrect - Key: HBASE-5974 URL: https://issues.apache.org/jira/browse/HBASE-5974 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0 Reporter: Todd Lipcon Assignee: Anoop Sam John Priority: Critical Fix For: 0.94.1 Attachments: 5974_94-V4.patch, HBASE-5974_0.94.patch, HBASE-5974_94-V2.patch, HBASE-5974_94-V3.patch I'm seeing the following behavior: - set RPC timeout to a short value - call next() for some batch of rows, big enough so the client times out before the result is returned - the HConnectionManager stuff will retry the next() call to the same server. At this point, one of two things can happen: 1) the previous next() call will still be processing, in which case you get a LeaseException, because it was removed from the map during the processing, or 2) the next() call will succeed but skip the prior batch of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5699) Run with 1 WAL in HRegionServer
[ https://issues.apache.org/jira/browse/HBASE-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288592#comment-13288592 ] Zhihong Ted Yu commented on HBASE-5699: --- Can you run ycsb with 50% insert and 50% update load ? Performance numbers in attachment match what I got based on my implementation. Thanks Run with 1 WAL in HRegionServer - Key: HBASE-5699 URL: https://issues.apache.org/jira/browse/HBASE-5699 Project: HBase Issue Type: Improvement Reporter: binlijin Assignee: Li Pi Attachments: PerfHbase.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288606#comment-13288606 ] Zhihong Ted Yu commented on HBASE-6067: --- @Daryn: Looking at: http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/Class.html#getDeclaredMethod%28java.lang.String,%20java.lang.Class[]%29 I don't see why getDeclaredMethod wouldn't find the desired method from this.fs w.r.t. setAccessible(), I didn't include it in patch v1 for this reason. Then I found that getGetNumCurrentReplicas() was using the call. And: {code} public int getNumCurrentReplicas() throws IOException { ./src/hdfs/org/apache/hadoop/hdfs/DFSClient.java {code} HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288608#comment-13288608 ] Zhihong Ted Yu commented on HBASE-6060: --- Agreed. We can integrate once patch for 0.92 is ready. Regions's in OPENING state from failed regionservers takes a long time to recover - Key: HBASE-6060 URL: https://issues.apache.org/jira/browse/HBASE-6060 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, HBASE-6060-94.patch we have seen a pattern in tests, that the regions are stuck in OPENING state for a very long time when the region server who is opening the region fails. My understanding of the process: - master calls rs to open the region. If rs is offline, a new plan is generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), HMaster.assign() - RegionServer, starts opening a region, changes the state in znode. But that znode is not ephemeral. (see ZkAssign) - Rs transitions zk node from OFFLINE to OPENING. See OpenRegionHandler.process() - rs then opens the region, and changes znode from OPENING to OPENED - when rs is killed between OPENING and OPENED states, then zk shows OPENING state, and the master just waits for rs to change the region state, but since rs is down, that wont happen. - There is a AssignmentManager.TimeoutMonitor, which does exactly guard against these kind of conditions. It periodically checks (every 10 sec by default) the regions in transition to see whether they timedout (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, which explains what you and I are seeing. - ServerShutdownHandler in Master does not reassign regions in OPENING state, although it handles other states. Lowering that threshold from the configuration is one option, but still I think we can do better. Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288636#comment-13288636 ] Zhihong Ted Yu commented on HBASE-6067: --- @Stack: Do you think this JIRA should be in 0.92.2 RC ? HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5974) Scanner retry behavior with RPC timeout on next() seems incorrect
[ https://issues.apache.org/jira/browse/HBASE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288644#comment-13288644 ] Zhihong Ted Yu commented on HBASE-5974: --- I would listen to Todd and Andy's opinion. Scanner retry behavior with RPC timeout on next() seems incorrect - Key: HBASE-5974 URL: https://issues.apache.org/jira/browse/HBASE-5974 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0 Reporter: Todd Lipcon Assignee: Anoop Sam John Priority: Critical Fix For: 0.96.0, 0.94.1 Attachments: 5974_94-V4.patch, 5974_trunk.patch, HBASE-5974_0.94.patch, HBASE-5974_94-V2.patch, HBASE-5974_94-V3.patch I'm seeing the following behavior: - set RPC timeout to a short value - call next() for some batch of rows, big enough so the client times out before the result is returned - the HConnectionManager stuff will retry the next() call to the same server. At this point, one of two things can happen: 1) the previous next() call will still be processing, in which case you get a LeaseException, because it was removed from the map during the processing, or 2) the next() call will succeed but skip the prior batch of rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6055) Snapshots in HBase 0.96
[ https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288733#comment-13288733 ] Zhihong Ted Yu commented on HBASE-6055: --- bq. The HLog will have edits from regions not relevant to the table's regions. Over in HBASE-5699, each one of the multiple WALs can be devised to receive edits from single table. Snapshots in HBase 0.96 --- Key: HBASE-6055 URL: https://issues.apache.org/jira/browse/HBASE-6055 Project: HBase Issue Type: New Feature Components: client, master, regionserver, zookeeper Reporter: Jesse Yates Assignee: Jesse Yates Fix For: 0.96.0 Attachments: Snapshots in HBase.docx Continuation of HBASE-50 for the current trunk. Since the implementation has drastically changed, opening as a new ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6067: -- Attachment: 6067-addendum.txt HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-addendum.txt, 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288740#comment-13288740 ] Zhihong Ted Yu commented on HBASE-6067: --- Addendum integrated to 0.94 and trunk. Thanks for the tip, Daryn. HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6067-addendum.txt, 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6046) Master retry on ZK session expiry causes inconsistent region assignments.
[ https://issues.apache.org/jira/browse/HBASE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288756#comment-13288756 ] Zhihong Ted Yu commented on HBASE-6046: --- bq. New ServerManager is not created I think there is a typo above: 'not' - 'now' Master retry on ZK session expiry causes inconsistent region assignments. - Key: HBASE-6046 URL: https://issues.apache.org/jira/browse/HBASE-6046 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.1, 0.94.0 Reporter: Gopinathan A Assignee: ramkrishna.s.vasudevan Attachments: HBASE_6046-trunk.patch, HBASE_6046_0.94.patch, HBASE_6046_0.94_1.patch, HBASE_6046_0.94_2.patch, HBASE_6046_0.94_3.patch 1 ZK Session timeout in the hmaster leads to bulk assignment though all the RSs are online. 2 While doing bulk assignment, if the master again goes down restart(or backup comes up) all the node created in the ZK will now be tried to reassign to the new RSs. This is leading to double assignment. we had 2800 regions, among this 1900 region got double assignment, taking the region count to 4700. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6131) Add attribution for code added by HBASE-5533 metrics
[ https://issues.apache.org/jira/browse/HBASE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6131: -- Fix Version/s: 0.96.0 0.92.2 Add attribution for code added by HBASE-5533 metrics Key: HBASE-6131 URL: https://issues.apache.org/jira/browse/HBASE-6131 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Priority: Blocker Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6131.txt, 6131_092.txt, 6131_094.txt See the comment over in https://issues.apache.org/jira/browse/HBASE-5533?focusedCommentId=13283920page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13283920 The metrics histogram code was copied w/o attribution. Fix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-6158) Data loss if the words 'merges' or 'splits' are used as Column Family name
[ https://issues.apache.org/jira/browse/HBASE-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu reassigned HBASE-6158: - Assignee: Aditya Kishore Data loss if the words 'merges' or 'splits' are used as Column Family name -- Key: HBASE-6158 URL: https://issues.apache.org/jira/browse/HBASE-6158 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.94.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Attachments: HBASE-6158.patch If a table is creates with either 'merges' or 'splits' as one of the Column Family name it can never be flushed to the disk even though the table creation (and data population) succeeds. The reason for this is that these two are used as temporary directory names inside the region folder or merge and splits respectively and hence conflicts with the directories created for CF with same name. A simple fix would be to uses .merges' and .splits as the working folder (patch attached). This will also be consistent with other work folder names. An alternate fix would be to declare these words (and other similar) as reserve words and throw exception when they are used. However, I do find the alternate approach as unnecessarily restrictive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6158) Data loss if the words 'merges' or 'splits' are used as Column Family name
[ https://issues.apache.org/jira/browse/HBASE-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288769#comment-13288769 ] Zhihong Ted Yu commented on HBASE-6158: --- Can you generate patch for trunk ? {code} - static final String MERGEDIR = merges; + static final String MERGEDIR = .merges; {code} The above constant is only used in HRegion. We can make it private, right ? Data loss if the words 'merges' or 'splits' are used as Column Family name -- Key: HBASE-6158 URL: https://issues.apache.org/jira/browse/HBASE-6158 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.94.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Fix For: 0.96.0, 0.94.1 Attachments: HBASE-6158.patch If a table is creates with either 'merges' or 'splits' as one of the Column Family name it can never be flushed to the disk even though the table creation (and data population) succeeds. The reason for this is that these two are used as temporary directory names inside the region folder or merge and splits respectively and hence conflicts with the directories created for CF with same name. A simple fix would be to uses .merges' and .splits as the working folder (patch attached). This will also be consistent with other work folder names. An alternate fix would be to declare these words (and other similar) as reserve words and throw exception when they are used. However, I do find the alternate approach as unnecessarily restrictive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6158) Data loss if the words 'merges' or 'splits' are used as Column Family name
[ https://issues.apache.org/jira/browse/HBASE-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6158: -- Fix Version/s: 0.94.1 0.96.0 Data loss if the words 'merges' or 'splits' are used as Column Family name -- Key: HBASE-6158 URL: https://issues.apache.org/jira/browse/HBASE-6158 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.94.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Fix For: 0.96.0, 0.94.1 Attachments: HBASE-6158.patch If a table is creates with either 'merges' or 'splits' as one of the Column Family name it can never be flushed to the disk even though the table creation (and data population) succeeds. The reason for this is that these two are used as temporary directory names inside the region folder or merge and splits respectively and hence conflicts with the directories created for CF with same name. A simple fix would be to uses .merges' and .splits as the working folder (patch attached). This will also be consistent with other work folder names. An alternate fix would be to declare these words (and other similar) as reserve words and throw exception when they are used. However, I do find the alternate approach as unnecessarily restrictive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6158) Data loss if the words 'merges' or 'splits' are used as Column Family name
[ https://issues.apache.org/jira/browse/HBASE-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6158: -- Hadoop Flags: Reviewed Status: Patch Available (was: Open) Data loss if the words 'merges' or 'splits' are used as Column Family name -- Key: HBASE-6158 URL: https://issues.apache.org/jira/browse/HBASE-6158 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.94.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Fix For: 0.96.0, 0.94.1 Attachments: HBASE-6158_94.patch, HBASE-6158_trunk.patch If a table is creates with either 'merges' or 'splits' as one of the Column Family name it can never be flushed to the disk even though the table creation (and data population) succeeds. The reason for this is that these two are used as temporary directory names inside the region folder or merge and splits respectively and hence conflicts with the directories created for CF with same name. A simple fix would be to uses .merges' and .splits as the working folder (patch attached). This will also be consistent with other work folder names. An alternate fix would be to declare these words (and other similar) as reserve words and throw exception when they are used. However, I do find the alternate approach as unnecessarily restrictive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6067: -- Fix Version/s: (was: 0.92.3) 0.92.2 Integrated to 0.92 branch. HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6067-addendum.txt, 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6067) HBase won't start when hbase.rootdir uses ViewFileSystem
[ https://issues.apache.org/jira/browse/HBASE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6067: -- Resolution: Fixed Status: Resolved (was: Patch Available) HBase won't start when hbase.rootdir uses ViewFileSystem Key: HBASE-6067 URL: https://issues.apache.org/jira/browse/HBASE-6067 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Eli Collins Assignee: Zhihong Ted Yu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6067-addendum.txt, 6067-v2.txt, 6067.txt HBase currently doesn't work with HDFS federation (hbase.rootdir with a client that uses viewfs) because HLog#init uses FileSystem#getDefaultBlockSize and getDefaultReplication. These throw an exception because there is no default filesystem in a viewfs client so there's no way to determine a default block size or replication factor. They could use the versions of these methods that take a path, however these were introduced in HADOOP-8014 and are not yet available in Hadoop 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6160) META entries from daughters can be deleted before parent entries
[ https://issues.apache.org/jira/browse/HBASE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288892#comment-13288892 ] Zhihong Ted Yu commented on HBASE-6160: --- @Enis: Can you attach log snippets to show the problem ? e.g. what was the duration between the two splits. META entries from daughters can be deleted before parent entries Key: HBASE-6160 URL: https://issues.apache.org/jira/browse/HBASE-6160 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.92.2, 0.94.0, 0.96.0 Reporter: Enis Soztutar Assignee: Enis Soztutar HBASE-5986 fixed and issue, where the client sees the META entry for the parent, but not the children. However, after the fix, we have seen the following issue in tests: Region A is split to - B, C Region B is split to - D, E After some time, META entry for B is deleted since it is not needed anymore, but META entry for Region A stays in META (C still refers it). In this case, the client throws RegionOfflineException for B. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-6160) META entries from daughters can be deleted before parent entries
[ https://issues.apache.org/jira/browse/HBASE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288892#comment-13288892 ] Zhihong Ted Yu edited comment on HBASE-6160 at 6/4/12 8:47 PM: --- @Enis: Can you attach log snippets to show the problem ? e.g. what was the interval between the two splits. was (Author: zhi...@ebaysf.com): @Enis: Can you attach log snippets to show the problem ? e.g. what was the duration between the two splits. META entries from daughters can be deleted before parent entries Key: HBASE-6160 URL: https://issues.apache.org/jira/browse/HBASE-6160 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.92.2, 0.94.0, 0.96.0 Reporter: Enis Soztutar Assignee: Enis Soztutar HBASE-5986 fixed and issue, where the client sees the META entry for the parent, but not the children. However, after the fix, we have seen the following issue in tests: Region A is split to - B, C Region B is split to - D, E After some time, META entry for B is deleted since it is not needed anymore, but META entry for Region A stays in META (C still refers it). In this case, the client throws RegionOfflineException for B. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-3271) Allow .META. table to be exported
[ https://issues.apache.org/jira/browse/HBASE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935184#comment-12935184 ] Zhihong Ted Yu edited comment on HBASE-3271 at 6/4/12 11:32 PM: I used this code: {code} if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { HRegionLocation regLoc = table.getRegionLocation(HConstants.EMPTY_BYTE_ARRAY); if (null == regLoc) throw new IOException(Expecting at least one region.); ListInputSplit splits = new ArrayListInputSplit(1); InputSplit split = new TableSplit(table.getTableName(), HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc.getServerAddress().getHostname()); splits.add(split); return splits; } {code} The following command only exports rows in .META. which have 'packageindex' (refer to HBASE-3255): bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 packageindex -rwxrwxrwx 1 hadoop users 90700 Nov 24 03:31 h-meta/part-m-0 was (Author: yuzhih...@gmail.com): I used this code: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { HRegionLocation regLoc = table.getRegionLocation(HConstants.EMPTY_BYTE_ARRAY); if (null == regLoc) throw new IOException(Expecting at least one region.); ListInputSplit splits = new ArrayListInputSplit(1); InputSplit split = new TableSplit(table.getTableName(), HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc.getServerAddress().getHostname()); splits.add(split); return splits; } The following command only exports rows in .META. which have 'packageindex' (refer to HBASE-3255): bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 packageindex -rwxrwxrwx 1 hadoop users 90700 Nov 24 03:31 h-meta/part-m-0 Allow .META. table to be exported - Key: HBASE-3271 URL: https://issues.apache.org/jira/browse/HBASE-3271 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.20.6 Reporter: Ted Yu I tried to export .META. table in 0.20.6 and got: [hadoop@us01-ciqps1-name01 hbase]$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2010-11-23 20:59:05.255::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2010-11-23 20:59:05.255::INFO: verisons=1, starttime=0, endtime=9223372036854775807 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:host.name=us01-ciqps1-name01.carrieriq.com 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_21 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. ... 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020 Exception in thread main java.io.IOException: Expecting at least one region. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146) Related code is: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { throw new IOException(Expecting at least one region.); } My intention was to save the dangling rows in .META. (for future investigation) which prevented a table from being created. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6160) META entries from daughters can be deleted before parent entries
[ https://issues.apache.org/jira/browse/HBASE-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289131#comment-13289131 ] Zhihong Ted Yu commented on HBASE-6160: --- Minor comments: {code} + * Scans META ane returns a pair of number of scanned rows, and {code} 'ane' - 'and' I think 'a pair of' is not needed above. {code} +//we could not clean the parent, so it's daughters should not be cleaned as well (HBASE-6160) {code} 'as well' - 'either' META entries from daughters can be deleted before parent entries Key: HBASE-6160 URL: https://issues.apache.org/jira/browse/HBASE-6160 Project: HBase Issue Type: Bug Components: client, regionserver Affects Versions: 0.92.2, 0.94.0, 0.96.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Attachments: HBASE-6160_v1.patch HBASE-5986 fixed and issue, where the client sees the META entry for the parent, but not the children. However, after the fix, we have seen the following issue in tests: Region A is split to - B, C Region B is split to - D, E After some time, META entry for B is deleted since it is not needed anymore, but META entry for Region A stays in META (C still refers it). In this case, the client throws RegionOfflineException for B. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
[ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289455#comment-13289455 ] Zhihong Ted Yu commented on HBASE-6147: --- Nice start. {code} + Thread.sleep(100); + waitedTimeForMasterInitialized += 100; {code} We don't know how long sleep() call may actually have taken. Better maintain timing ourselves. {code} + Thread.currentThread().interrupt(); + throw new IOException(Interrupted, e); {code} InterruptedIOException should be created above. SSH and AM.joinCluster leads to region assignment inconsistency in many cases. -- Key: HBASE-6147 URL: https://issues.apache.org/jira/browse/HBASE-6147 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: ramkrishna.s.vasudevan Fix For: 0.92.3 Attachments: HBASE-6147.patch We are facing few issues in the master restart and SSH going in parallel. Chunhui also suggested that we need to rework on this part. This JIRA is aimed at solving all such possibilities of region assignment inconsistency -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289503#comment-13289503 ] Zhihong Ted Yu commented on HBASE-6060: --- {code} + if (newPlan) { +this.regionPlans.remove(randomPlan.getRegionName()); +LOG +.info(Server shutdown handler already in progress for the region ++ randomPlan.getRegionName()); +randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE; {code} It would be confusing to label a new plan 'ALREADY_INUSE'. {code} + // the following singleton signifies that the plan is not usable + static final RegionPlan REGION_PLAN_ALREADY_INUSE = new RegionPlan(null, null, null); {code} I think UNUSABLE_REGION_PLAN would be a better name. Regions's in OPENING state from failed regionservers takes a long time to recover - Key: HBASE-6060 URL: https://issues.apache.org/jira/browse/HBASE-6060 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 6060-trunk_3.patch, HBASE-6060-92.patch, HBASE-6060-94.patch we have seen a pattern in tests, that the regions are stuck in OPENING state for a very long time when the region server who is opening the region fails. My understanding of the process: - master calls rs to open the region. If rs is offline, a new plan is generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), HMaster.assign() - RegionServer, starts opening a region, changes the state in znode. But that znode is not ephemeral. (see ZkAssign) - Rs transitions zk node from OFFLINE to OPENING. See OpenRegionHandler.process() - rs then opens the region, and changes znode from OPENING to OPENED - when rs is killed between OPENING and OPENED states, then zk shows OPENING state, and the master just waits for rs to change the region state, but since rs is down, that wont happen. - There is a AssignmentManager.TimeoutMonitor, which does exactly guard against these kind of conditions. It periodically checks (every 10 sec by default) the regions in transition to see whether they timedout (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, which explains what you and I are seeing. - ServerShutdownHandler in Master does not reassign regions in OPENING state, although it handles other states. Lowering that threshold from the configuration is one option, but still I think we can do better. Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6162) Move KeyValue to hbase-common module
[ https://issues.apache.org/jira/browse/HBASE-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6162: -- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) Move KeyValue to hbase-common module Key: HBASE-6162 URL: https://issues.apache.org/jira/browse/HBASE-6162 Project: HBase Issue Type: Improvement Affects Versions: 0.96.0 Reporter: Matt Corgan Assignee: Matt Corgan Fix For: 0.96.0 Attachments: HBASE-6162-v1.patch * pull KeyValue up to hbase-common module This is part of the modularization strategy in HBASE-5977, and is specifically necessary to modularize HBASE-4676. also brings these classes to hbase-common: * ClassSize, HeapSize * HTestConst * TestKeyValue, KeyValueTestUtil * LoadTestKVGenerator, TestLoadTestKVGenerator * MD5Hash moves a trivial constant (HRegionInfo.DELIMITER) from HRegionInfo to HConstants -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289703#comment-13289703 ] Zhihong Ted Yu commented on HBASE-5924: --- Nice work. {code} private static class ProcessR { {code} Can we name the above class more meaningfully ? javadoc is desirable. {code} * @param sleepTime - sleep time befora actually executing the actions. Can be zero. {code} 'befora' - 'before' {code} for (ActionR aTodo : actionsList) { {code} aToDo - anAction ? {code} CallableMultiResponse callable = createDelayedCallable(sleepTime, e.getKey(), e.getValue()); TripleMultiActionR, HRegionLocation, FutureMultiResponse p = new TripleMultiActionR, HRegionLocation, FutureMultiResponse(e.getValue(), e.getKey(), this.pool.submit(callable)); {code} Wrap the two long lines above. {code} throw new IllegalArgumentException( argument results must be the same size as argument list); {code} It would be nice to include the sizes in exception message. In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5924.v5.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassign the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5816: -- Description: The first assign thread exits with success after updating the RegionState to PENDING_OPEN, while the second assign follows immediately into assign and fails the RegionState check in setOfflineInZooKeeper(). This causes the master to abort. In the below case, the two concurrent assigns occurred when AM tried to assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler tried to assign this region (from the region plan) spontaneously. {code} 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. (offlining) 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 2012-04-17 05:44:57,666 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=CLOSED, ts=1334612697672, server=hadoop05.sh.intel.com,60020,1334544902186 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x236b912e9b3000e Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:19,159 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN, ts=1334613179096, server=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:59,033 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 remote=/10.239.47.87:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302
[jira] [Commented] (HBASE-6164) Correct the bug in block encoding usage in bulkload
[ https://issues.apache.org/jira/browse/HBASE-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290189#comment-13290189 ] Zhihong Ted Yu commented on HBASE-6164: --- Patch looks good. {code} - // Save data block encoder metadata in the file info. - dataBlockEncoder.saveMetadata(this); {code} Why is the above method lifted out of StoreFile ? Correct the bug in block encoding usage in bulkload --- Key: HBASE-6164 URL: https://issues.apache.org/jira/browse/HBASE-6164 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0, 0.94.1 Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 0.96.0, 0.94.1 Attachments: 6164_94.patch, 6164_Trunk.patch Address the issue raised under HBASE-6040 https://issues.apache.org/jira/browse/HBASE-6040?focusedCommentId=13289334page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13289334 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign since SSH using
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290254#comment-13290254 ] Zhihong Ted Yu commented on HBASE-6012: --- {code} + ListRegionOpeningState regionOpeningStateList = this.serverManager + .sendRegionOpen(destination, regions); + for (int i = 0; i regionOpeningStateList.size(); i++) { {code} Should we check whether the return from sendRegionOpen() is null ? In ServerShutdownHandler.java: {code} +if(rit != null){ + //clean zk node + try{ +ZKAssign.deleteNodeFailSilent(services.getZooKeeper(), e.getKey()); {code} Log statement should be added that reveals the value of rit. Handling RegionOpeningState for bulk assign since SSH using --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5360) [uberhbck] Add options for how to handle offline split parents.
[ https://issues.apache.org/jira/browse/HBASE-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290300#comment-13290300 ] Zhihong Ted Yu commented on HBASE-5360: --- {code} - Path referencePath = getReferredToFile(p); + Path referencePath = StoreFileUtil.getReferredToFile(p); {code} Can you outline the rationale behind moving methods to StoreFileUtil ? For sidelineSplitParent(): {code} +if (needReassign) { + toBeReassigned.add(child); +} {code} The above code is inside inner loop. Would a child be added to toBeReassigned multiple times ? {code} + // scenario (2) + errors.reportError(ERROR_CODE.FAILED_SPLIT_PARENT, Region ++ descriptiveName + , key= + key + , on HDFS, failed split parent); + if (shouldFixSplitParents()) { +resetSplitParent(hbi); {code} Do we need to do something about the children in above scenario ? I think we should provide two flags to user corresponding to the two scenarios so that they can decide which scenario(s) to fix. {code} + * Daughters do refer to parent. + */ + @Test + public void testLingeringSplitParent2() throws Exception { {code} Please give the two test cases meaningful names that are consistent with javadoc. [uberhbck] Add options for how to handle offline split parents. Key: HBASE-5360 URL: https://issues.apache.org/jira/browse/HBASE-5360 Project: HBase Issue Type: Improvement Components: hbck Affects Versions: 0.90.7, 0.92.1, 0.94.0 Reporter: Jonathan Hsieh Assignee: Jimmy Xiang Attachments: hbase-5360.path In a recent case, we attempted to repair a cluster that suffered from HBASE-4238 that had about 6-7 generations of leftover split data. The hbck repair options in an development version of HBASE-5128 treat HDFS as ground truth but didn't check SPLIT and OFFLINE flags only found in meta. The net effect was that it essentially attempted to merge many regions back into its eldest geneneration's parent's range. More safe guards to prevent mega-merges are being added on HBASE-5128. This issue would automate the handling of the mega-merge avoiding cases such as lingering grandparents. The strategy here would be to add more checks against .META., and perform part of the catalog janitor's responsibilities for lingering grandparents. This would potentially include options to sideline regions, deleting grandparent regions, min size for sidelining, and mechanisms for cleaning .META.. Note: There already exists an mechanism to reload these regions -- the bulk loaded mechanisms in LoadIncrementalHFiles can be used to re-add grandparents (automatically splitting them if necessary) to HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6162) Move KeyValue to hbase-common module
[ https://issues.apache.org/jira/browse/HBASE-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290305#comment-13290305 ] Zhihong Ted Yu commented on HBASE-6162: --- Hadoop QA didn't run any test. See https://builds.apache.org/job/PreCommit-HBASE-Build/2108/console. Move KeyValue to hbase-common module Key: HBASE-6162 URL: https://issues.apache.org/jira/browse/HBASE-6162 Project: HBase Issue Type: Improvement Affects Versions: 0.96.0 Reporter: Matt Corgan Assignee: Matt Corgan Fix For: 0.96.0 Attachments: HBASE-6162-v1.patch * pull KeyValue up to hbase-common module This is part of the modularization strategy in HBASE-5977, and is specifically necessary to modularize HBASE-4676. also brings these classes to hbase-common: * ClassSize, HeapSize * HTestConst * TestKeyValue, KeyValueTestUtil * LoadTestKVGenerator, TestLoadTestKVGenerator * MD5Hash moves a trivial constant (HRegionInfo.DELIMITER) from HRegionInfo to HConstants -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6153) RS aborted due to rename problem (maybe a race)
[ https://issues.apache.org/jira/browse/HBASE-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290367#comment-13290367 ] Zhihong Ted Yu commented on HBASE-6153: --- ip-10-68-7-146.ec2.internal went down: {code} 2012-05-31 18:34:42,541 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ip-10-68-7-146.ec2.internal,60020,1338343120038: Replay of HLog required. Forcing server shutdown {code} The above lagged the other log snippets by 3 hours. More log around 05-31 15:11 from ip-10-68-7-146.ec2.internal should help clarify. RS aborted due to rename problem (maybe a race) --- Key: HBASE-6153 URL: https://issues.apache.org/jira/browse/HBASE-6153 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Devaraj Das Assignee: Devaraj Das I had a RS crash with the following: 2012-05-31 18:34:42,534 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35 to hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35 2012-05-31 18:34:42,536 WARN org.apache.hadoop.hbase.regionserver.Store: Unable to rename hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35 to hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35 2012-05-31 18:34:42,541 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ip-10-68-7-146.ec2.internal,60020,1338343120038: Replay of HLog required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/87_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1288) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1172) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1114) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:400) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:374) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:243) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.FileNotFoundException: File does not exist: /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1901) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1892) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:636) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:387) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:470) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548) at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:595) On the NameNode logs: 2012-05-31 18:34:42,588 WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35 to /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35 because destination's parent does not exist I haven't looked deeply yet but I guess it is a race of some sort. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290483#comment-13290483 ] Zhihong Ted Yu commented on HBASE-5924: --- {code} + // We need the origin multi action to find out what are the actions to replay if {code} 'origin' - 'original', 'what are the actions to replay' - 'what actions to replay' {code} +} catch (InterruptedException e) { + throw new IOException(e); {code} InterruptedIOException should be thrown. {code} + // mutate list so that it is empty for complete success, or contains + // only failed records results are returned in the same order as the + // requests in list walk the list backwards, so we can remove from list {code} The above is hard to read. A period between 'records' and 'results' ? A period between 'list' and 'walk' ? Hadoop QA didn't run tests: https://builds.apache.org/job/PreCommit-HBASE-Build/2116/console In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290509#comment-13290509 ] Zhihong Ted Yu commented on HBASE-5924: --- hbase-server/src/main/java/org/apache/hadoop/hbase/util/Triple.java was not included in patch v9. Hence: {code} [ERROR] /home/hduser/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java:[2061,23] cannot find symbol [ERROR] symbol : class Triple [ERROR] location: class org.apache.hadoop.hbase.client.HConnectionManager.HConnectionImplementation.ProcessR {code} In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6162) Move KeyValue to hbase-common module
[ https://issues.apache.org/jira/browse/HBASE-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290525#comment-13290525 ] Zhihong Ted Yu commented on HBASE-6162: --- The patch produces a lot of compilation errors: {code} [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestProcessBasedCluster.java:[29,30] cannot find symbol [ERROR] symbol : class HTestConst [ERROR] location: package org.apache.hadoop.hbase [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java:[121,11] cannot find symbol [ERROR] symbol : variable LoadTestKVGenerator [ERROR] location: class org.apache.hadoop.hbase.util.MultiThreadedWriter [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java:[128,58] cannot find symbol [ERROR] symbol : class LoadTestKVGenerator [ERROR] location: class org.apache.hadoop.hbase.util.MultiThreadedWriter.HBaseWriterThread [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java:[191,24] cannot find symbol [ERROR] symbol : variable LoadTestKVGenerator [ERROR] location: class org.apache.hadoop.hbase.util.MultiThreadedWriter.HBaseWriterThread [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedReader.java:[226,10] cannot find symbol [ERROR] symbol : variable LoadTestKVGenerator [ERROR] location: class org.apache.hadoop.hbase.util.MultiThreadedReader.HBaseReaderThread [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedReader.java:[282,17] cannot find symbol [ERROR] symbol : variable LoadTestKVGenerator [ERROR] location: class org.apache.hadoop.hbase.util.MultiThreadedReader.HBaseReaderThread [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/filter/TestColumnRangeFilter.java:[192,24] cannot find symbol [ERROR] symbol : variable KeyValueTestUtil [ERROR] location: class org.apache.hadoop.hbase.filter.TestColumnRangeFilter [ERROR] [ERROR] /home/hduser/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreScanner.java:[80,8] cannot find symbol [ERROR] symbol : variable KeyValueTestUtil [ERROR] location: class org.apache.hadoop.hbase.regionserver.TestStoreScanner {code} Move KeyValue to hbase-common module Key: HBASE-6162 URL: https://issues.apache.org/jira/browse/HBASE-6162 Project: HBase Issue Type: Improvement Affects Versions: 0.96.0 Reporter: Matt Corgan Assignee: Matt Corgan Fix For: 0.96.0 Attachments: HBASE-6162-v1.patch * pull KeyValue up to hbase-common module This is part of the modularization strategy in HBASE-5977, and is specifically necessary to modularize HBASE-4676. also brings these classes to hbase-common: * ClassSize, HeapSize * HTestConst * TestKeyValue, KeyValueTestUtil * LoadTestKVGenerator, TestLoadTestKVGenerator * MD5Hash moves a trivial constant (HRegionInfo.DELIMITER) from HRegionInfo to HConstants -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign since SSH using
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291122#comment-13291122 ] Zhihong Ted Yu commented on HBASE-6012: --- {code} LOG.info(Unable to communicate with the region server in order + to assign regions, e); - return false; + // Server may already get RPC + return true; {code} What was the reasoning behind the above change ? {code} - try { -if (!assign(e.getKey(), e.getValue())) { - failedPlans.put(e.getKey(), e.getValue()); -} - } catch (Throwable t) { + if (!assign(e.getKey(), e.getValue())) { {code} I think the catch clause should be kept. For HRegionServer.java, there're a lot of formatting changes which distract reviewing. {code} + } catch (RegionAlreadyInTransitionException rie) { +LOG.warn(, rie); {code} Please add some sentence for the log above. Handling RegionOpeningState for bulk assign since SSH using --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6147) SSH and AM.joinCluster leads to region assignment inconsistency in many cases.
[ https://issues.apache.org/jira/browse/HBASE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291139#comment-13291139 ] Zhihong Ted Yu commented on HBASE-6147: --- In testing phase, an option may be introduced to enable the following: {code} + waitTillMasterInitialized(); {code} so that we can compare performance difference. SSH and AM.joinCluster leads to region assignment inconsistency in many cases. -- Key: HBASE-6147 URL: https://issues.apache.org/jira/browse/HBASE-6147 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: ramkrishna.s.vasudevan Fix For: 0.92.3 Attachments: HBASE-6147.patch, HBASE-6147_trunk.patch We are facing few issues in the master restart and SSH going in parallel. Chunhui also suggested that we need to rework on this part. This JIRA is aimed at solving all such possibilities of region assignment inconsistency -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291182#comment-13291182 ] Zhihong Ted Yu commented on HBASE-5498: --- {code} + * @param familyPaths list of family names to store files adding + * or removing from this list will add or remove HFiles to be bulk loaded. {code} Add a period between files and adding. Capitalize 'a' of adding. {code} +for(Pairbyte[], String el: familyPaths) +families.add(el.getFirst()); {code} Space between for and (, el and colon. families.add() should be put on the same line as for. {code} -class StoreFileScanner implements KeyValueScanner { +public class StoreFileScanner implements KeyValueScanner { {code} I don't see StoreFileScanner accessed in AccessController. So the above change is not needed. {code} + //TODO make this configurable + //two levels so it doesn't get deleted accidentally + //no sticky bit in Hadoop 1.0 + private Path stagingDir = new Path(/tmp/hbase-staging); {code} I think the path should be configurable. {code} + private User getActiveUser() throws IOException { +User user = RequestContext.getRequestUser(); +if (!RequestContext.isInRequestContext()) { {code} if statement can be lifted above assignment. {code} +public interface SecureBulkLoadProtocol extends CoprocessorProtocol { {code} Add javadoc for the protocol. Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Attachments: HBASE-5498_draft.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu reassigned HBASE-5498: - Assignee: Francis Liu Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Assignee: Francis Liu Attachments: HBASE-5498_draft.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5498: -- Fix Version/s: 0.96.0 Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-5498_draft.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291273#comment-13291273 ] Zhihong Ted Yu commented on HBASE-5498: --- What happens if user continues using LoadIncrementalHFiles directly ? Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-5498_draft.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5533) Add more metrics to HBase
[ https://issues.apache.org/jira/browse/HBASE-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291318#comment-13291318 ] Zhihong Ted Yu commented on HBASE-5533: --- I saw a lot of the following in test output (https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/3000/testReport/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/testShutdownFixupWhenDaughterHasSplit/): {code} 2012-06-07 18:33:14,623 ERROR [RegionServer:0;juno.apache.org,39424,1339093992166] util.MetricsDynamicMBeanBase(116): unknown metrics type: org.apache.hadoop.hbase.metrics.histogram.MetricsHistogram {code} Add more metrics to HBase - Key: HBASE-5533 URL: https://issues.apache.org/jira/browse/HBASE-5533 Project: HBase Issue Type: Improvement Affects Versions: 0.92.2, 0.94.0 Reporter: Shaneal Manek Assignee: Shaneal Manek Priority: Minor Fix For: 0.92.2, 0.94.0, 0.96.0 Attachments: BlockingQueueContention.java, HBASE-5533-0.92-v4.patch, HBASE-5533-TRUNK-v6.patch, HBASE-5533-TRUNK-v6.patch, HBASE-5533-v7-0.92.patch, TimingOverhead.java, hbase-5533-0.92.patch, hbase5533-0.92-v2.patch, hbase5533-0.92-v3.patch, hbase5533-0.92-v5.patch, histogram_web_ui.png To debug/monitor production clusters, there are some more metrics I wish I had available. In particular: - Although the average FS latencies are useful, a 'histogram' of recent latencies (90% of reads completed in under 100ms, 99% in under 200ms, etc) would be more useful - Similar histograms of latencies on common operations (GET, PUT, DELETE) would be useful - Counting the number of accesses to each region to detect hotspotting - Exposing the current number of HLog files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291353#comment-13291353 ] Zhihong Ted Yu commented on HBASE-5726: --- From https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/3000/testReport/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/testShutdownFixupWhenDaughterHasSplit/: {code} 2012-06-07 18:33:22,794 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZKUtil(1142): master:49315-0x137c838bfa6 Retrieved 103 byte(s) of data from znode /hbase/unassigned/73830568ee93434ba97f7b5ade48ae30 and set watcher; region=ephemeral,,1339093997065.73830568ee93434ba97f7b5ade48ae30., state=RS_ZK_REGION_SPLITTING, servername=juno.apache.org,39424,1339093992166, createTime=1339094002792, payload.length=0 ... 2012-06-07 18:33:47,887 DEBUG [Thread-941] regionserver.TestSplitTransactionOnCluster(482): Waiting on region to split 2012-06-07 18:33:47,922 DEBUG [RegionServer:8;juno.apache.org,43570,1339094025325-splits-1339094027483] regionserver.HRegion(463): Instantiated testMasterRestartAtRegionSplitPendingCatalogJanitor,,1339094027484.23694c0a5312f5801dfd5a2857cc3556. 2012-06-07 18:33:23,648 DEBUG [RegionServer:0;juno.apache.org,39424,1339093992166-splits-1339094002786] regionserver.HRegion(463): Instantiated ephemeral,mnk,1339094002786.b5c2d9c3e0939c583f874e3efd51b478. 2012-06-07 18:33:23,680 INFO [RegionServer:0;juno.apache.org,39424,1339093992166-splits-1339094002786] catalog.MetaEditor(191): Offlined parent region ephemeral,,1339093997065.73830568ee93434ba97f7b5ade48ae30. in META {code} We can see that region 73830568ee93434ba97f7b5ade48ae30 didn't finish splitting after the last 'Waiting on region to split' was printed. In split() method: {code} while (ProtobufUtil.getOnlineRegions(server).size() = regionCount) { LOG.debug(Waiting on region to split); {code} I think the above method should be improved: if a region is moved onto server, the loop would exit but number of daughter regions wouldn't be 2. TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Priority: Critical Attachments: Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5726: -- Attachment: 5726.txt I ran TestSplitTransactionOnCluster#testShutdownFixupWhenDaughterHasSplit 5 times with the patch - they passed. TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Priority: Critical Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5726: -- Status: Patch Available (was: Open) TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Priority: Critical Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5726: -- Fix Version/s: 0.96.0 TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Priority: Critical Fix For: 0.96.0 Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291476#comment-13291476 ] Zhihong Ted Yu commented on HBASE-5924: --- TestRegionServerCoprocessorExceptionWithAbort failed on QA machine. Should investigate. In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5924.v11.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291474#comment-13291474 ] Zhihong Ted Yu commented on HBASE-5498: --- bq. we should make LoadIncrementalHFiles use the secure api when security is enabled? Sure. Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-5498_draft.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291477#comment-13291477 ] Zhihong Ted Yu commented on HBASE-5726: --- Integrated to trunk. Will resolve after at least 5 trunk builds where this test passes. Thanks for the review, Stack. TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Priority: Critical Fix For: 0.96.0 Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5924: -- Fix Version/s: 0.96.0 Hadoop Flags: Reviewed In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5924.v11.patch, 5924.v14.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291657#comment-13291657 ] Zhihong Ted Yu commented on HBASE-5924: --- With RSTracker gone, the following flag is no longer checked: {code} -public synchronized void nodeDeleted(String path) { - if (path.equals(rsNode)) { -regionZKNodeWasDeleted = true; {code} Can we keep the check ? {code} -assertTrue(RegionServer aborted on coprocessor exception, as expected., -rsTracker.regionZKNodeWasDeleted); {code} I think this should be kept: {code} -table.close(); {code} In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5924.v11.patch, 5924.v14.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291811#comment-13291811 ] Zhihong Ted Yu commented on HBASE-5924: --- {code} $ find hbase-server/src/test -name '*.java' -exec grep 'nized void nodeDeleted(Str' {} \; -print public synchronized void nodeDeleted(String path) { hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestMasterCoprocessorExceptionWithAbort.java public synchronized void nodeDeleted(String path) { hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestMasterCoprocessorExceptionWithRemove.java public synchronized void nodeDeleted(String path) { hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithAbort.java {code} The other two checks are for master znode. w.r.t. table.close(), it is good programming practice of cleaning up resources. In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5924.v11.patch, 5924.v14.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291889#comment-13291889 ] Zhihong Ted Yu commented on HBASE-5924: --- {code} +if (loc == null) + throw new IOException(); {code} Without braces, the throw statement should be on the same line as if. Please include a brief message for the exception. Some long lines should be wrapped: {code} +final MapHRegionLocation, MultiActionR actionsByServer = new HashMapHRegionLocation, MultiActionR(); ... +new TripleMultiActionR, HRegionLocation, FutureMultiResponse(e.getValue(), e.getKey(), this.pool.submit(callable)); {code} In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5924.v11.patch, 5924.v14.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6194) add open time for a region and list recently closed regions in a regionserver UI
[ https://issues.apache.org/jira/browse/HBASE-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292187#comment-13292187 ] Zhihong Ted Yu commented on HBASE-6194: --- bq. all the region servers that it is hosting Did you mean regions ? add open time for a region and list recently closed regions in a regionserver UI Key: HBASE-6194 URL: https://issues.apache.org/jira/browse/HBASE-6194 Project: HBase Issue Type: Improvement Reporter: Feifei Ji The region server currently lists all the region servers that it is hosting. It will be useful to report when those regions were opened on this server. It will also be useful to report what and when were the recent regions closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5726: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Trunk build has succeeded 5 times. TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Assignee: Zhihong Ted Yu Priority: Critical Fix For: 0.96.0 Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5726) TestSplitTransactionOnCluster occasionally failing
[ https://issues.apache.org/jira/browse/HBASE-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu reassigned HBASE-5726: - Assignee: Zhihong Ted Yu TestSplitTransactionOnCluster occasionally failing -- Key: HBASE-5726 URL: https://issues.apache.org/jira/browse/HBASE-5726 Project: HBase Issue Type: Bug Reporter: Uma Maheswara Rao G Assignee: Zhihong Ted Yu Priority: Critical Fix For: 0.96.0 Attachments: 5726.txt, Hbase.log_testExistingZnodeBlocksSplitAndWeRollback testShutdownFixupWhenDaughterHasSplit, Hbase.log_testRSSplitEphemeralsDisappearButDaughtersAreOnlinedAfterShutdownHandling When I ran TestSplitTransactionOnCluster, some times tests are failing. {quote} java.lang.AssertionError: expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.getAndCheckSingleTableRegion(TestSplitTransactionOnCluster.java:89) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit(TestSplitTransactionOnCluster.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62) {quote} Seems like test is flaky, random other cases also fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6060) Regions's in OPENING state from failed regionservers takes a long time to recover
[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292530#comment-13292530 ] Zhihong Ted Yu commented on HBASE-6060: --- For 6060_suggestion_toassign_rs_wentdown_beforerequest.patch: Can you give the following variable better name ? {code} +SetHRegionInfo regionPlans = new ConcurrentSkipListSetHRegionInfo(); {code} The set doesn't hold region plans. The following javadoc needs to be adjusted accordingly. {code} + * @return Pair that has all regionplans that pertain to this dead server and a list that has {code} {code} + if ((region.getState() == RegionState.State.OFFLINE) + (region.getState() == RegionState.State.PENDING_OPEN)) { {code} A region cannot be in both states at the same time. '||' should be used instead of '' {code} +deadRegions = new TreeSetHRegionInfo(assignedRegions); {code} Since the fulfillment of deadRegions above is in a different code block from the following: {code} if (deadRegions.remove(region.getRegion())) { {code} Running testSSHWhenSourceRSandDestRSInRegionPlanGoneDown (from v3) would lead to NPE w.r.t. deadRegions After fixing the above, testSSHWhenSourceRSandDestRSInRegionPlanGoneDown still fails. Regions's in OPENING state from failed regionservers takes a long time to recover - Key: HBASE-6060 URL: https://issues.apache.org/jira/browse/HBASE-6060 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: Enis Soztutar Assignee: rajeshbabu Fix For: 0.96.0, 0.94.1, 0.92.3 Attachments: 6060-94-v3.patch, 6060-94-v4.patch, 6060-94-v4_1.patch, 6060-94-v4_1.patch, 6060-trunk.patch, 6060-trunk.patch, 6060-trunk_2.patch, 6060-trunk_3.patch, 6060_alternative_suggestion.txt, 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, HBASE-6060-92.patch, HBASE-6060-94.patch we have seen a pattern in tests, that the regions are stuck in OPENING state for a very long time when the region server who is opening the region fails. My understanding of the process: - master calls rs to open the region. If rs is offline, a new plan is generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), HMaster.assign() - RegionServer, starts opening a region, changes the state in znode. But that znode is not ephemeral. (see ZkAssign) - Rs transitions zk node from OFFLINE to OPENING. See OpenRegionHandler.process() - rs then opens the region, and changes znode from OPENING to OPENED - when rs is killed between OPENING and OPENED states, then zk shows OPENING state, and the master just waits for rs to change the region state, but since rs is down, that wont happen. - There is a AssignmentManager.TimeoutMonitor, which does exactly guard against these kind of conditions. It periodically checks (every 10 sec by default) the regions in transition to see whether they timedout (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, which explains what you and I are seeing. - ServerShutdownHandler in Master does not reassign regions in OPENING state, although it handles other states. Lowering that threshold from the configuration is one option, but still I think we can do better. Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6012) Handling RegionOpeningState for bulk assign
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6012: -- Summary: Handling RegionOpeningState for bulk assign (was: Handling RegionOpeningState for bulk assign since SSH using) Handling RegionOpeningState for bulk assign --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch, HBASE-6012v6.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6195: -- Status: Patch Available (was: Open) Increment data will lost when the memstore flushed -- Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292849#comment-13292849 ] Zhihong Ted Yu commented on HBASE-6195: --- @Xing: Hadoop QA run wasn't triggered. Can you add a unit test showing this problem and present test suite results ? Thanks Increment data will lost when the memstore flushed -- Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will lost when the memstore flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292862#comment-13292862 ] Zhihong Ted Yu commented on HBASE-6195: --- In patch v3: {code} + long now = EnvironmentEdgeManager.currentTimeMillis(); Integer lid = getLock(lockid, row, true); {code} Variable now isn't actually referenced. Do we need it ? {code} + //store the kvs to the tmp memory for write hlog first, then write memory {code} The above should read: 'to temporary memstore before writing HLog' Increment data will lost when the memstore flushed -- Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6195) Increment data will be lost when the memstore is flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6195: -- Assignee: ShiXing Hadoop Flags: Reviewed Summary: Increment data will be lost when the memstore is flushed (was: Increment data will lost when the memstore flushed) Increment data will be lost when the memstore is flushed Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Assignee: ShiXing Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292868#comment-13292868 ] Zhihong Ted Yu commented on HBASE-6012: --- @Chunhui: Hadoop QA is not functioning. Can you run the whole test suite and post the result ? Handling RegionOpeningState for bulk assign --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch, HBASE-6012v6.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293270#comment-13293270 ] Zhihong Ted Yu commented on HBASE-6012: --- There was NPE in TestAssignmentManager#testSSHWhenSplitRegionInProgress Please fix. Thanks Handling RegionOpeningState for bulk assign --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch, HBASE-6012v6.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293356#comment-13293356 ] Zhihong Ted Yu commented on HBASE-6012: --- I ran the two failed tests manually and they passed. Will integrate tomorrow if there is no objection. Handling RegionOpeningState for bulk assign --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch, HBASE-6012v6.patch, HBASE-6012v7.patch, HBASE-6012v8.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5914) Bulk assign regions in the process of ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-5914: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Bulk assign regions in the process of ServerShutdownHandler --- Key: HBASE-5914 URL: https://issues.apache.org/jira/browse/HBASE-5914 Project: HBase Issue Type: Improvement Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-5914.patch, HBASE-5914v2.patch, HBASE-5914v3.patch In the process of ServerShutdownHandler, we currently assign regions singly. In the large cluster, one regionserver always carried many regions, this action is quite slow. What about using bulk assign regions like cluster start up. In current logic, if we failed assigning many regions to one destination server, we will wait unitl timeout, however in the process of ServerShutdownHandler, we should retry it to another server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293375#comment-13293375 ] Zhihong Ted Yu commented on HBASE-5564: --- Minor comment: {code} + throw new BadTsvLineException(Invalid timestamp); {code} Can the timestamp string be included ? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5699) Run with 1 WAL in HRegionServer
[ https://issues.apache.org/jira/browse/HBASE-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293517#comment-13293517 ] Zhihong Ted Yu commented on HBASE-5699: --- As I mentioned in HBASE-6055 @ 04/Jun/12 17:47, one of the benefits of this feature is for each HLog file to receive edits for one single table. Run with 1 WAL in HRegionServer - Key: HBASE-5699 URL: https://issues.apache.org/jira/browse/HBASE-5699 Project: HBase Issue Type: Improvement Reporter: binlijin Assignee: Li Pi Attachments: PerfHbase.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6012) Handling RegionOpeningState for bulk assign
[ https://issues.apache.org/jira/browse/HBASE-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293668#comment-13293668 ] Zhihong Ted Yu commented on HBASE-6012: --- Integrated to trunk. Thanks for the patch, Chunhui. Thanks for the review, Stack and Ram. Handling RegionOpeningState for bulk assign --- Key: HBASE-6012 URL: https://issues.apache.org/jira/browse/HBASE-6012 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6012.patch, HBASE-6012v2.patch, HBASE-6012v3.patch, HBASE-6012v4.patch, HBASE-6012v5.patch, HBASE-6012v6.patch, HBASE-6012v7.patch, HBASE-6012v8.patch Since HBASE-5914, we using bulk assign for SSH But in the bulk assign case if we get an ALREADY_OPENED case there is no one to clear the znode created by bulk assign. Another thing, when RS opening a list of regions, if one region is already in transition, it will throw RegionAlreadyInTransitionException and stop opening other regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will be lost when the memstore is flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293708#comment-13293708 ] Zhihong Ted Yu commented on HBASE-6195: --- The new test fails without fix in patch: {code} Failed tests: testParallelIncrementWithMemStoreFlush(org.apache.hadoop.hbase.regionserver.TestHRegion): expected:2000 but was:968 {code} Will integrate this afternoon if there is no objection. Increment data will be lost when the memstore is flushed Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Assignee: ShiXing Attachments: HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk-V4.patch, HBASE-6195-trunk-V5.patch, HBASE-6195-trunk-V6.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6195) Increment data will be lost when the memstore is flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6195: -- Attachment: 6195-trunk-V7.patch Modified the test slightly. Made Incrementer class private, removed unused variable. Increment data will be lost when the memstore is flushed Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Assignee: ShiXing Attachments: 6195-trunk-V7.patch, HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk-V4.patch, HBASE-6195-trunk-V5.patch, HBASE-6195-trunk-V6.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
[ https://issues.apache.org/jira/browse/HBASE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293778#comment-13293778 ] Zhihong Ted Yu commented on HBASE-5924: --- I don't have further comments. Thanks In the client code, don't wait for all the requests to be executed before resubmitting a request in error. -- Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5924.v11.patch, 5924.v14.patch, 5924.v5.patch, 5924.v9.patch The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6195) Increment data will be lost when the memstore is flushed
[ https://issues.apache.org/jira/browse/HBASE-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293828#comment-13293828 ] Zhihong Ted Yu commented on HBASE-6195: --- Integrated to trunk. Thanks for the patch, Xing. Increment data will be lost when the memstore is flushed Key: HBASE-6195 URL: https://issues.apache.org/jira/browse/HBASE-6195 Project: HBase Issue Type: Bug Components: regionserver Reporter: Xing Shi Assignee: ShiXing Attachments: 6195-trunk-V7.patch, HBASE-6195-trunk-V2.patch, HBASE-6195-trunk-V3.patch, HBASE-6195-trunk-V4.patch, HBASE-6195-trunk-V5.patch, HBASE-6195-trunk-V6.patch, HBASE-6195-trunk.patch There are two problems in increment() now: First: I see that the timestamp(the variable now) in HRegion's Increment() is generated before got the rowLock, so when there are multi-thread increment the same row, although it generate earlier, it may got the lock later. Because increment just store one version, so till now, the result will still be right. When the region is flushing, these increment will read the kv from snapshot and memstore with whose timestamp is larger, and write it back to memstore. If the snapshot's timestamp larger than the memstore, the increment will got the old data and then do the increment, it's wrong. Secondly: Also there is a risk in increment. Because it writes the memstore first and then HLog, so if it writes HLog failed, the client will also read the incremented value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6185) region autoSplit when not reach 'hbase.hregion.max.filesize'
[ https://issues.apache.org/jira/browse/HBASE-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294004#comment-13294004 ] Zhihong Ted Yu commented on HBASE-6185: --- Please also wrap the long line in the patch. Currently we maintain 100 characters per line. region autoSplit when not reach 'hbase.hregion.max.filesize' Key: HBASE-6185 URL: https://issues.apache.org/jira/browse/HBASE-6185 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.94.0 Reporter: nneverwei Attachments: HBASE-6185.patch When using hbase0.94.0 we met a strange problem. We config the 'hbase.hregion.max.filesize' to 100Gb (The recommed value to act as auto-split turn off). {code:xml} property namehbase.hregion.max.filesize/name value107374182400/value /property {code} Then we keep putting datas into a table. But when the data size far more less than 100Gb(about 500~600 uncompressed datas), the table auto splte to 2 regions... I change the log4j config to DEBUG, and saw logs below: {code} 2012-06-07 10:30:52,161 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.0m/134221272, currentsize=1.5m/1617744 for region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. in 3201ms, sequenceid=176387980, compaction requested=false 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,240 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.. compaction_queue=(0:0), split_queue=0 2012-06-07 10:30:52,265 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. 2012-06-07 10:30:52,265 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:60020-0x137c4929efe0001 Creating ephemeral node for 7b229abcd0785408251a579e9bdf49c8 in SPLITTING state 2012-06-07 10:30:52,368 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Attempting to transition node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,382 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Successfully transitioned node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.: disabling compactions flushes 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing 2012-06-07 10:30:52,411 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing {code} {color:red}IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728{color} I did not config splitPolicy for hbase, so it means *IncreasingToUpperBoundRegionSplitPolicy is the default splitPolicy of 0.94.0* After add {code:xml} property namehbase.regionserver.region.split.policy/name valueorg.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy/value /property {code} autosplit did not happen again and everything goes well. But we can still see javadoc on ConstantSizeRegionSplitPolicy, it says 'This is the default split policy'. Or even in the http://hbase.apache.org/book/regions.arch.html 9.7.4.1. Custom Split Policies, 'default split policy: ConstantSizeRegionSplitPolicy.'. Those may mistaken us that if we set hbase.hregion.max.filesize to 100Gb, than the auto-split can be almost shutdown. You may change those docs, and What more, in many scenerys, we actually need to control split manually(As you know when spliting the table are offline, reads and writes will fail) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6134) Improvement for split-worker to speed up distributed-split-log
[ https://issues.apache.org/jira/browse/HBASE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6134: -- Attachment: 6134v4.patch TestSplitLogManager passes locally. Reattaching patch v4. Improvement for split-worker to speed up distributed-split-log -- Key: HBASE-6134 URL: https://issues.apache.org/jira/browse/HBASE-6134 Project: HBase Issue Type: Improvement Components: wal Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0 Attachments: 6134v4.patch, HBASE-6134.patch, HBASE-6134v2.patch, HBASE-6134v3-92.patch, HBASE-6134v3.patch, HBASE-6134v4.patch First,we do the test between local-master-splitting and distributed-log-splitting Environment:34 hlog files, 5 regionservers,(after kill one, only 4 rs do ths splitting work), 400 regions in one hlog file local-master-split:60s+ distributed-log-splitting:165s+ In fact, in our production environment, distributed-log-splitting also took 60s with 30 regionservers for 34 hlog files (regionserver may be in high load) We found split-worker split one log file took about 20s (30ms~50ms per writer.close(); 10ms per create writers ) I think we could do the improvement for this: Parallelizing the create and close writers in threads In the patch, change the logic for distributed-log-splitting same as the local-master-splitting and parallelizing the close in threads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6134) Improvement for split-worker to speed up distributed-split-log
[ https://issues.apache.org/jira/browse/HBASE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6134: -- Hadoop Flags: Reviewed Improvement for split-worker to speed up distributed-split-log -- Key: HBASE-6134 URL: https://issues.apache.org/jira/browse/HBASE-6134 Project: HBase Issue Type: Improvement Components: wal Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0 Attachments: 6134v4.patch, HBASE-6134.patch, HBASE-6134v2.patch, HBASE-6134v3-92.patch, HBASE-6134v3.patch, HBASE-6134v4.patch First,we do the test between local-master-splitting and distributed-log-splitting Environment:34 hlog files, 5 regionservers,(after kill one, only 4 rs do ths splitting work), 400 regions in one hlog file local-master-split:60s+ distributed-log-splitting:165s+ In fact, in our production environment, distributed-log-splitting also took 60s with 30 regionservers for 34 hlog files (regionserver may be in high load) We found split-worker split one log file took about 20s (30ms~50ms per writer.close(); 10ms per create writers ) I think we could do the improvement for this: Parallelizing the create and close writers in threads In the patch, change the logic for distributed-log-splitting same as the local-master-splitting and parallelizing the close in threads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6134) Improvement for split-worker to speed up distributed-split-log
[ https://issues.apache.org/jira/browse/HBASE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294053#comment-13294053 ] Zhihong Ted Yu commented on HBASE-6134: --- TestServerCustomProtocol passes locally. Will integrate later if there is no objection. Improvement for split-worker to speed up distributed-split-log -- Key: HBASE-6134 URL: https://issues.apache.org/jira/browse/HBASE-6134 Project: HBase Issue Type: Improvement Components: wal Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0 Attachments: 6134v4.patch, HBASE-6134.patch, HBASE-6134v2.patch, HBASE-6134v3-92.patch, HBASE-6134v3.patch, HBASE-6134v4.patch First,we do the test between local-master-splitting and distributed-log-splitting Environment:34 hlog files, 5 regionservers,(after kill one, only 4 rs do ths splitting work), 400 regions in one hlog file local-master-split:60s+ distributed-log-splitting:165s+ In fact, in our production environment, distributed-log-splitting also took 60s with 30 regionservers for 34 hlog files (regionserver may be in high load) We found split-worker split one log file took about 20s (30ms~50ms per writer.close(); 10ms per create writers ) I think we could do the improvement for this: Parallelizing the create and close writers in threads In the patch, change the logic for distributed-log-splitting same as the local-master-splitting and parallelizing the close in threads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5970) Improve the AssignmentManager#updateTimer and speed up handling opened event
[ https://issues.apache.org/jira/browse/HBASE-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294072#comment-13294072 ] Zhihong Ted Yu commented on HBASE-5970: --- @Chunhui: You can open a new issue for improving the above code. Improve the AssignmentManager#updateTimer and speed up handling opened event Key: HBASE-5970 URL: https://issues.apache.org/jira/browse/HBASE-5970 Project: HBase Issue Type: Improvement Components: master Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0 Attachments: 5970v3.patch, HBASE-5970.patch, HBASE-5970v2.patch, HBASE-5970v3.patch, HBASE-5970v4.patch, HBASE-5970v4.patch We found handing opened event very slow in the environment with lots of regions. The problem is the slow AssignmentManager#updateTimer. We do the test for bulk assigning 10w (i.e. 100k) regions, the whole process of bulk assigning took 1 hours. 2012-05-06 20:31:49,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 10 region(s) round-robin across 5 server(s) 2012-05-06 21:26:32,103 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done I think we could do the improvement for the AssignmentManager#updateTimer: Make a thread do this work. After the improvement, it took only 4.5mins 2012-05-07 11:03:36,581 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 10 region(s) across 5 server(s), retainAssignment=true 2012-05-07 11:07:57,073 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6185) region autoSplit when not reach 'hbase.hregion.max.filesize'
[ https://issues.apache.org/jira/browse/HBASE-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294079#comment-13294079 ] Zhihong Ted Yu commented on HBASE-6185: --- {code} + * This is the default split policy. From 0.94.0 the default split policy change {code} The above should read 'This was the default split policy. From 0.94.0 on the default split policy has changed' region autoSplit when not reach 'hbase.hregion.max.filesize' Key: HBASE-6185 URL: https://issues.apache.org/jira/browse/HBASE-6185 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.94.0 Reporter: nneverwei Fix For: 0.94.1 Attachments: HBASE-6185.patch When using hbase0.94.0 we met a strange problem. We config the 'hbase.hregion.max.filesize' to 100Gb (The recommed value to act as auto-split turn off). {code:xml} property namehbase.hregion.max.filesize/name value107374182400/value /property {code} Then we keep putting datas into a table. But when the data size far more less than 100Gb(about 500~600 uncompressed datas), the table auto splte to 2 regions... I change the log4j config to DEBUG, and saw logs below: {code} 2012-06-07 10:30:52,161 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.0m/134221272, currentsize=1.5m/1617744 for region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. in 3201ms, sequenceid=176387980, compaction requested=false 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,240 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.. compaction_queue=(0:0), split_queue=0 2012-06-07 10:30:52,265 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. 2012-06-07 10:30:52,265 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:60020-0x137c4929efe0001 Creating ephemeral node for 7b229abcd0785408251a579e9bdf49c8 in SPLITTING state 2012-06-07 10:30:52,368 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Attempting to transition node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,382 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Successfully transitioned node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.: disabling compactions flushes 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing 2012-06-07 10:30:52,411 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing {code} {color:red}IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728{color} I did not config splitPolicy for hbase, so it means *IncreasingToUpperBoundRegionSplitPolicy is the default splitPolicy of 0.94.0* After add {code:xml} property namehbase.regionserver.region.split.policy/name valueorg.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy/value /property {code} autosplit did not happen again and everything goes well. But we can still see javadoc on ConstantSizeRegionSplitPolicy, it says 'This is the default split policy'. Or even in the http://hbase.apache.org/book/regions.arch.html 9.7.4.1. Custom Split Policies, 'default split policy: ConstantSizeRegionSplitPolicy.'. Those may mistaken us that if we set hbase.hregion.max.filesize to 100Gb, than the auto-split can be almost shutdown. You may change those docs, and What more, in many scenerys, we actually need to control split manually(As you know when spliting the table are offline, reads and writes will fail) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Updated] (HBASE-6185) Update javadoc for ConstantSizeRegionSplitPolicy class
[ https://issues.apache.org/jira/browse/HBASE-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6185: -- Hadoop Flags: Reviewed Summary: Update javadoc for ConstantSizeRegionSplitPolicy class (was: region autoSplit when not reach 'hbase.hregion.max.filesize') Update javadoc for ConstantSizeRegionSplitPolicy class -- Key: HBASE-6185 URL: https://issues.apache.org/jira/browse/HBASE-6185 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.94.0 Reporter: nneverwei Fix For: 0.94.1 Attachments: HBASE-6185.patch When using hbase0.94.0 we met a strange problem. We config the 'hbase.hregion.max.filesize' to 100Gb (The recommed value to act as auto-split turn off). {code:xml} property namehbase.hregion.max.filesize/name value107374182400/value /property {code} Then we keep putting datas into a table. But when the data size far more less than 100Gb(about 500~600 uncompressed datas), the table auto splte to 2 regions... I change the log4j config to DEBUG, and saw logs below: {code} 2012-06-07 10:30:52,161 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.0m/134221272, currentsize=1.5m/1617744 for region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. in 3201ms, sequenceid=176387980, compaction requested=false 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,161 DEBUG org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728, regionsWithCommonTable=1 2012-06-07 10:30:52,240 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.. compaction_queue=(0:0), split_queue=0 2012-06-07 10:30:52,265 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. 2012-06-07 10:30:52,265 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:60020-0x137c4929efe0001 Creating ephemeral node for 7b229abcd0785408251a579e9bdf49c8 in SPLITTING state 2012-06-07 10:30:52,368 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Attempting to transition node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,382 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x137c4929efe0001 Successfully transitioned node 7b229abcd0785408251a579e9bdf49c8 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8.: disabling compactions flushes 2012-06-07 10:30:52,410 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing 2012-06-07 10:30:52,411 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; FileStructIndex,,1339032525500.7b229abcd0785408251a579e9bdf49c8. is closing {code} {color:red}IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=138657416, sizeToCheck=134217728{color} I did not config splitPolicy for hbase, so it means *IncreasingToUpperBoundRegionSplitPolicy is the default splitPolicy of 0.94.0* After add {code:xml} property namehbase.regionserver.region.split.policy/name valueorg.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy/value /property {code} autosplit did not happen again and everything goes well. But we can still see javadoc on ConstantSizeRegionSplitPolicy, it says 'This is the default split policy'. Or even in the http://hbase.apache.org/book/regions.arch.html 9.7.4.1. Custom Split Policies, 'default split policy: ConstantSizeRegionSplitPolicy.'. Those may mistaken us that if we set hbase.hregion.max.filesize to 100Gb, than the auto-split can be almost shutdown. You may change those docs, and What more, in many scenerys, we actually need to control split manually(As you know when spliting the table are offline, reads and writes will fail) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: