[jira] [Commented] (HBASE-5074) support checksums in HBase block cache
[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173012#comment-13173012 ] Todd Lipcon commented on HBASE-5074: bq. One minor disadvantage of this approach is that checksums would be computed twice, once by the hbase regionserver and once by the hdfs client. How bad is this cpu overhead? You mean on write? The native CRC32C implementation in HDFS trunk right now can do somewhere around 6GB/sec - I clocked it at about 16% overhead compared to the non-checksummed path a while ago. So I think overhead is fairly minimal. bq. I am proposing that HBase disk format V3 have a 4 byte checksum for every hbase block 4 byte checksum for 64KB+ of data seems pretty low. IMO we should continue to do chunked checksums - maybe a CRC32 for every 1KB in the block. This allows people to use larger block sizes without compromising checksum effectiveness. The reason to choose chunked CRC32 over a wider hash is that CRC32 has a very efficient hardware implementation in SSE4.2. Plus, we can share all the JNI code already developed for Hadoop to calculate and verify these style of checksums :) support checksums in HBase block cache -- Key: HBASE-5074 URL: https://issues.apache.org/jira/browse/HBASE-5074 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read into the HBase block cache actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5066) Upgrade to zk 3.4.1
[ https://issues.apache.org/jira/browse/HBASE-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173022#comment-13173022 ] Hudson commented on HBASE-5066: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5066 Upgrade to zk 3.4.1 stack : Files : * /hbase/trunk/pom.xml Upgrade to zk 3.4.1 --- Key: HBASE-5066 URL: https://issues.apache.org/jira/browse/HBASE-5066 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Fix For: 0.92.0 Attachments: 5066.txt Currently we are shipping 0.92 with 3.4.1rc2 which is what became the release but change the pom to get the release; it looks better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5062) Missing logons if security is enabled
[ https://issues.apache.org/jira/browse/HBASE-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173026#comment-13173026 ] Hudson commented on HBASE-5062: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5062 Missing logons if security is enabled stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/rest/Main.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/Strings.java Missing logons if security is enabled - Key: HBASE-5062 URL: https://issues.apache.org/jira/browse/HBASE-5062 Project: HBase Issue Type: Bug Components: rest, security, thrift Affects Versions: 0.92.0 Reporter: Andrew Purtell Assignee: Andrew Purtell Fix For: 0.92.0 Attachments: HBASE-5062-v2.patch, HBASE-5062.patch Somehow the attached changes are missing from the security integration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5029) TestDistributedLogSplitting fails on occasion
[ https://issues.apache.org/jira/browse/HBASE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173023#comment-13173023 ] Hudson commented on HBASE-5029: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5029 TestDistributedLogSplitting fails on occasion; Added catch of NPE and reenabled ignored test HBASE-5029 TestDistributedLogSplitting fails on occasion; disabling failing test -- redo -- forgot to import @Ignore HBASE-5029 TestDistributedLogSplitting fails on occasion; disabling failing test stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java stack : Files : * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java stack : Files : * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java TestDistributedLogSplitting fails on occasion - Key: HBASE-5029 URL: https://issues.apache.org/jira/browse/HBASE-5029 Project: HBase Issue Type: Bug Reporter: stack Assignee: Prakash Khemani Priority: Critical Attachments: 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch, 5029-addingignore.txt, 5029-catch-dfsclient-npe-v2.txt, 5029-catch-dfsclient-npe.txt, HBASE-5029.D891.1.patch, HBASE-5029.D891.2.patch This is how it usually fails: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testWorkerAbort/ Assigning mighty Prakash since he offered to take a looksee. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5051) HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call
[ https://issues.apache.org/jira/browse/HBASE-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173024#comment-13173024 ] Hudson commented on HBASE-5051: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5051 HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/constraint/TestConstraint.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportExport.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFilesSplitRecovery.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRollAbort.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/rest/TestScannersWithFilters.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/rest/TestTableResource.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRebuildTestCore.java HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call -- Key: HBASE-5051 URL: https://issues.apache.org/jira/browse/HBASE-5051 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.94.0 Attachments: 5051.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch As it's a new instance, it should be closed. As the function name seems to imply that it's an instance managed by HBaseTestingUtility, most of the users don't close it = leak -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173025#comment-13173025 ] Hudson commented on HBASE-5063: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4935) hbase 0.92.0 doesn't work going against 0.20.205.0, its packaged hadoop
[ https://issues.apache.org/jira/browse/HBASE-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173028#comment-13173028 ] Hudson commented on HBASE-4935: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-4935 hbase 0.92.0 doesn't work going against 0.20.205.0, its packaged hadoop stack : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java hbase 0.92.0 doesn't work going against 0.20.205.0, its packaged hadoop --- Key: HBASE-4935 URL: https://issues.apache.org/jira/browse/HBASE-4935 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Fix For: 0.92.0 Attachments: 4935-reverse.txt, 4935-v3-092.txt, 4935-v3.txt, 4935-v3.txt, 4935.txt See this Mikhail thread up on the list: http://search-hadoop.com/m/WMUZR24EAJ1/%2522SequenceFileLogReader+uses+a+reflection+hack+resulting+in+runtime+failures%2522subj=Re+SequenceFileLogReader+uses+a+reflection+hack+resulting+in+runtime+failures Dig into it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5068) RC1 can not build its hadoop-0.23 profile
[ https://issues.apache.org/jira/browse/HBASE-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173030#comment-13173030 ] Hudson commented on HBASE-5068: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5068 RC1 can not build its hadoop-0.23 profile stack : Files : * /hbase/trunk/pom.xml RC1 can not build its hadoop-0.23 profile - Key: HBASE-5068 URL: https://issues.apache.org/jira/browse/HBASE-5068 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.0 Reporter: Roman Shaposhnik Assignee: Roman Shaposhnik Fix For: 0.92.0, 0.94.0 Attachments: HBASE-5068.patch.txt The hadoop .23 version needs to be bumped to 0.23.1-SNAPSHOT -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173027#comment-13173027 ] Hudson commented on HBASE-5060: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5060 HBase client is blocked forever (Jinchao) tedyu : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.92.0, 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at
[jira] [Commented] (HBASE-5058) Allow HBaseAmin to use an existing connection
[ https://issues.apache.org/jira/browse/HBASE-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173029#comment-13173029 ] Hudson commented on HBASE-5058: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5058 Allow HBaseAmin to use an existing connection (Lars H) larsh : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java Allow HBaseAmin to use an existing connection - Key: HBASE-5058 URL: https://issues.apache.org/jira/browse/HBASE-5058 Project: HBase Issue Type: Sub-task Components: client Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 5058-v2.txt, 5058-v3.txt, 5058-v3.txt, 5058.txt What HBASE-4805 does for HTables, this should do for HBaseAdmin. Along with this the shared error handling and retrying between HBaseAdmin and HConnectionManager can also be improved. I'll attach a first pass patch soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173063#comment-13173063 ] Hudson commented on HBASE-5063: --- Integrated in HBase-TRUNK #2562 (See [https://builds.apache.org/job/HBase-TRUNK/2562/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Open (was: Patch Available) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Attachment: 5064.v6.patch use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Patch Available (was: Open) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5067) HMaster uses wrong name for address (in stand-alone mode)
[ https://issues.apache.org/jira/browse/HBASE-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173079#comment-13173079 ] Eran Hirsch commented on HBASE-5067: To the best of my understanding, the problem is fixed in the trunk, but only to some extent. It seems like the flow would work correctly, but relies on the underlying VM implementation and assumes certain things which are not strictly assumable. = I'll explain... 1. the hostname is computed based on the reverse DNS like before 2. an InetSockedAddress is built from this hostname and stored locally as 'initialIsa' 3. The RPC server now is created using, among others, 'initialIsa.getHostName()' 4. The address which was binded on by the rpc server is stored as the HMaster field 'isa' 5. The server name is initialized with the 'isa' field's hostname. = Why is this problematic? Because it assumes things about the socked implementation which are not strictly enforced: We first call the 'bind' method of a ServerSocket object, with an InetSocketAddress instance. Later on we call ServerSocket's 'getLocalSocketAddress' to get this address instance back. There is no way to know if the same object is returned, or maybe a new object is built based on the IP, or whatever other way the implementation chooses. Specifically to our case, You can tell this would still hold the 'hostname' field we gave it, with our fully qualified dns name. To conclude, I think there is a semantic problem with the way the HMaster is initialzed in it's c'tor: 1. When creating the rpcServer, we should call the method with 'initialIsa.getAddress().getHostAddress()' (instead of 'initialIsa.getHostName()). This would also be consistent with the comment written next to this parameter, saying that we are sending an IP (because now we are sending a DNS name). 2. When setting the 'serverName' field, we need to use the local field 'hostname' computed earlier (instead of 'this.isa.getHostName()). == Notes: 1. The same problem applies to HRegionServer which uses almost the same initialization code in its c'tor. 2. I am not an HBase developer, so i don't know really how to add these changes myself. HMaster uses wrong name for address (in stand-alone mode) - Key: HBASE-5067 URL: https://issues.apache.org/jira/browse/HBASE-5067 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: Eran Hirsch In STANDALONE mode: When setting the configuration option hbase.master.dns.interface (and optional hbase.master.dns.nameserver) to non-default values, it is EXPECTED that the master node would report its fully qualified dns name when registering in ZooKeeper, BUT INSTEAD, the machines hostname is taken instead. For example, my machine is called (aka its hostname is...) machine1 but it's name in the network is machine1.our-dev-network.my-corp.com, so to find this machine's IP anywhere on the network i would need to query for the whole name (because trying to find machine1 is ambiguous on a network). Why is this a bug, because when trying to connect to this stand-alone hbase installation from outside the machine it is running on, when querying ZK for /hbase/master we get only the machine1 part, and then fail with an unresolvable address for the master (which later even gives a null pointer because of a missing null check). This is the stack trace when calling HTable's c'tor: java.lang.IllegalArgumentException: hostname can't be null at java.net.InetSocketAddress.init(InetSocketAddress.java:139) ~[na:1.7.0_02] at org.apache.hadoop.hbase.HServerAddress.getResolvedAddress(HServerAddress.java:108) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:64) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.dataToHServerAddress(RootRegionTracker.java:82) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.waitRootRegionLocation(RootRegionTracker.java:73) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:579) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:688) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:590)
[jira] [Updated] (HBASE-5073) Registered listeners not getting removed leading to memory leak in HBaseAdmin
[ https://issues.apache.org/jira/browse/HBASE-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5073: -- Attachment: HBASE-5073.patch For branch patch Registered listeners not getting removed leading to memory leak in HBaseAdmin - Key: HBASE-5073 URL: https://issues.apache.org/jira/browse/HBASE-5073 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 Attachments: HBASE-5073.patch HBaseAdmin apis like tableExists(), flush, split, closeRegion uses catalog tracker. Every time Root node tracker and meta node tracker are started and a listener is registered. But after the operations are performed the listeners are not getting removed. Hence if the admin apis are consistently used then it may lead to memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5066) Upgrade to zk 3.4.1
[ https://issues.apache.org/jira/browse/HBASE-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173131#comment-13173131 ] Hudson commented on HBASE-5066: --- Integrated in HBase-0.92-security #45 (See [https://builds.apache.org/job/HBase-0.92-security/45/]) HBASE-5066 Upgrade to zk 3.4.1 HBASE-5066 Upgrade to zk 3.4.1 stack : Files : * /hbase/branches/0.92/pom.xml stack : Files : * /hbase/branches/0.92/CHANGES.txt Upgrade to zk 3.4.1 --- Key: HBASE-5066 URL: https://issues.apache.org/jira/browse/HBASE-5066 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Fix For: 0.92.0 Attachments: 5066.txt Currently we are shipping 0.92 with 3.4.1rc2 which is what became the release but change the pom to get the release; it looks better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5029) TestDistributedLogSplitting fails on occasion
[ https://issues.apache.org/jira/browse/HBASE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173132#comment-13173132 ] Hudson commented on HBASE-5029: --- Integrated in HBase-0.92-security #45 (See [https://builds.apache.org/job/HBase-0.92-security/45/]) HBASE-5029 TestDistributedLogSplitting fails on occasion; Added catch of NPE and reenabled ignored test HBASE-5029 TestDistributedLogSplitting fails on occasion; disabling failing test -- redo HBASE-5029 TestDistributedLogSplitting fails on occasion; disabling failing test -- undoing an overcommitpatch -p0 -R x.txt HBASE-5029 TestDistributedLogSplitting fails on occasion; disabling failing test stack : Files : * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java stack : Files : * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java stack : Files : * /hbase/branches/0.92/pom.xml * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java stack : Files : * /hbase/branches/0.92/pom.xml * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java TestDistributedLogSplitting fails on occasion - Key: HBASE-5029 URL: https://issues.apache.org/jira/browse/HBASE-5029 Project: HBase Issue Type: Bug Reporter: stack Assignee: Prakash Khemani Priority: Critical Attachments: 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch, 5029-addingignore.txt, 5029-catch-dfsclient-npe-v2.txt, 5029-catch-dfsclient-npe.txt, HBASE-5029.D891.1.patch, HBASE-5029.D891.2.patch This is how it usually fails: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testWorkerAbort/ Assigning mighty Prakash since he offered to take a looksee. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173134#comment-13173134 ] Hudson commented on HBASE-5060: --- Integrated in HBase-0.92-security #45 (See [https://builds.apache.org/job/HBase-0.92-security/45/]) HBASE-5060 HBase client is blocked forever (Jinchao) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.92.0, 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173157#comment-13173157 ] Hadoop QA commented on HBASE-5064: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508062/5064.v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestInstantSchemaChange org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/554//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/554//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/554//console This message is automatically generated. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5065) wrong IllegalArgumentException thrown when creating an 'HServerAddress' with an un-reachable hostname
[ https://issues.apache.org/jira/browse/HBASE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173219#comment-13173219 ] Eran Hirsch commented on HBASE-5065: i am not an Hbase developer, i don't know how to provide a patch. Anyhow, i checked the trunk and this class has been deprecated all-together, so there is no need to fix this anymore (i guess...?) wrong IllegalArgumentException thrown when creating an 'HServerAddress' with an un-reachable hostname - Key: HBASE-5065 URL: https://issues.apache.org/jira/browse/HBASE-5065 Project: HBase Issue Type: Bug Components: util Affects Versions: 0.90.4 Reporter: Eran Hirsch Priority: Trivial When trying to build an 'HServerAddress' object with an unresolvable hostname: e.g. new HServerAddress(www.IAMUNREACHABLE.com:80) a call to 'getResolvedAddress' would cause the 'InetSocketAddress' c'tor to throw an IllegalArgumentException because it is called with a null 'hostname' parameter. This happens because there is no null-check after the static 'getBindAddressInternal' method returns a null value when the hostname is unresolved. This is a trivial bug because the code HServerAddress is expected to throw this kind of exception when this error occurs, but it is thrown for the wrong reason. The method 'checkBindAddressCanBeResolved' should be the one throwing the exception (and give a slightly different reason). Because of this reason the method call itself becomes redundent as it will always succeed in the current flow, because the case it checks is already checked for by the previous getResolvedAddress method. In short: an IllegalArgumentException is thrown with reason: hostname can't be null from the InetSocketAddress c'tor INSTEAD OF an IllegalArgumentException with reason: Could not resolve the DNS name of [BADHOSTNAME]:[PORT] from HServerAddress's checkBindCanBeResolved method. Stack trace: java.lang.IllegalArgumentException: hostname can't be null at java.net.InetSocketAddress.init(InetSocketAddress.java:139) ~[na:1.7.0_02] at org.apache.hadoop.hbase.HServerAddress.getResolvedAddress(HServerAddress.java:108) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:64) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.dataToHServerAddress(RootRegionTracker.java:82) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.waitRootRegionLocation(RootRegionTracker.java:73) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:579) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:688) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:590) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:688) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:594) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:559) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HTable.init(HTable.java:173) ~[hbase-0.90.4.jar:0.90.4] at org.apache.hadoop.hbase.client.HTable.init(HTable.java:147) ~[hbase-0.90.4.jar:0.90.4] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173236#comment-13173236 ] Zhihong Yu commented on HBASE-5009: --- I think we should use threadPool.awaitTermination() where a timeout can be specified so that we don't wait indefinitely. Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Attachments: HBASE-5009.patch, HBASE-5009_Branch90.patch The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5076) HBase shell hangs when creating some 'illegal' tables.
HBase shell hangs when creating some 'illegal' tables. -- Key: HBASE-5076 URL: https://issues.apache.org/jira/browse/HBASE-5076 Project: HBase Issue Type: Bug Components: shell Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Priority: Minor In hbase shell. These commands hang: {code} create 'hbase.version','foo' create 'splitlog','foo' {code} Interestingly {code} create 'hbase.id','foo' create existingtablename, 'foo' create '.META.','foo' create '-ROOT-','foo' {code} are properly rejected. We should probably either rename to make the files illegal table names (hbase.version to .hbase.version and splitlog to .splitlog) or we could add more special cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173318#comment-13173318 ] Zhihong Yu commented on HBASE-5064: --- Apart from the 5 failed tests, TestReplication hung. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5066) Upgrade to zk 3.4.1
[ https://issues.apache.org/jira/browse/HBASE-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173323#comment-13173323 ] stack commented on HBASE-5066: -- The tough part IIRC, was that we use 3.4.x APIs because the 3.3.x have been removed w/ no means of work around. Andrew? Upgrade to zk 3.4.1 --- Key: HBASE-5066 URL: https://issues.apache.org/jira/browse/HBASE-5066 Project: HBase Issue Type: Task Reporter: stack Assignee: Andrew Purtell Fix For: 0.92.0 Attachments: 5066.txt Currently we are shipping 0.92 with 3.4.1rc2 which is what became the release but change the pom to get the release; it looks better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5073) Registered listeners not getting removed leading to memory leak in HBaseAdmin
[ https://issues.apache.org/jira/browse/HBASE-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173330#comment-13173330 ] Zhihong Yu commented on HBASE-5073: --- +1 on patch, if tests pass. Registered listeners not getting removed leading to memory leak in HBaseAdmin - Key: HBASE-5073 URL: https://issues.apache.org/jira/browse/HBASE-5073 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 Attachments: HBASE-5073.patch HBaseAdmin apis like tableExists(), flush, split, closeRegion uses catalog tracker. Every time Root node tracker and meta node tracker are started and a listener is registered. But after the operations are performed the listeners are not getting removed. Hence if the admin apis are consistently used then it may lead to memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5073) Registered listeners not getting removed leading to memory leak in HBaseAdmin
[ https://issues.apache.org/jira/browse/HBASE-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173350#comment-13173350 ] stack commented on HBASE-5073: -- +1 Registered listeners not getting removed leading to memory leak in HBaseAdmin - Key: HBASE-5073 URL: https://issues.apache.org/jira/browse/HBASE-5073 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 Attachments: HBASE-5073.patch HBaseAdmin apis like tableExists(), flush, split, closeRegion uses catalog tracker. Every time Root node tracker and meta node tracker are started and a listener is registered. But after the operations are performed the listeners are not getting removed. Hence if the admin apis are consistently used then it may lead to memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173357#comment-13173357 ] Phabricator commented on HBASE-5072: Kannan has commented on the revision [jira] [HBASE-5072] Support Max Value for Per-Store Metrics. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaMetrics.java:392 minor: Since this is already a map with MutableDoubles, you can break this into two cases to avoid new allocations when possible. Something like: if (cur == null) { tmpMap.put(maxKey, new MutableDouble(val); } else if (cur.doubleValue() val) { cur.setValue(val); } REVISION DETAIL https://reviews.facebook.net/D945 Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5072: --- Attachment: D945.2.patch nspiegelberg updated the revision [jira] [HBASE-5072] Support Max Value for Per-Store Metrics. Reviewers: JIRA, mbautin, Kannan Added Kannan's peer review optimization REVISION DETAIL https://reviews.facebook.net/D945 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaMetrics.java src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5058) Allow HBaseAmin to use an existing connection
[ https://issues.apache.org/jira/browse/HBASE-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173364#comment-13173364 ] Lars Hofhansl commented on HBASE-5058: -- @Stack: I think that about sums it up. The complexity of layers and timeout stories are alleviated somewhat by parent HBASE-4805 (no per HTable threadpool, HTablePool no longer needed). I had a brief look at the first issue, unless I am missing something this would require a nontrivial amount of refactoring. The simplest would be to do all network IO from the Connection thread rather than the application thread (as described in HBASE-4956). Would need allow for the client to synchronize and retrieved exceptions on/from a Future. Short term, should we take HBASE-4805 all the way and a getTable(...) method to HConnection? (Or even further and add put/get/scan/etc methods that take a table name to HConnection?) Long term a design based on asynchhbase with a thin synchronous layer on top is probably the best option. Allow HBaseAmin to use an existing connection - Key: HBASE-5058 URL: https://issues.apache.org/jira/browse/HBASE-5058 Project: HBase Issue Type: Sub-task Components: client Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 5058-v2.txt, 5058-v3.txt, 5058-v3.txt, 5058.txt What HBASE-4805 does for HTables, this should do for HBaseAdmin. Along with this the shared error handling and retrying between HBaseAdmin and HConnectionManager can also be improved. I'll attach a first pass patch soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (HBASE-5058) Allow HBaseAmin to use an existing connection
[ https://issues.apache.org/jira/browse/HBASE-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173364#comment-13173364 ] Lars Hofhansl edited comment on HBASE-5058 at 12/20/11 6:12 PM: @Stack: I think that about sums it up. The complexity of layers and timeout stories are alleviated somewhat by parent HBASE-4805 (no per HTable threadpool, HTablePool no longer needed). I had a brief look at the first issue, unless I am missing something this would require a nontrivial amount of refactoring. The simplest would be to do all network IO from the Connection thread rather than the application thread (as described in HBASE-4956). Would need allow for the client to synchronize and retrieve exceptions on/from a Future. Short term, should we take HBASE-4805 all the way and add a getTable(...) method to HConnection? (Or even further and add put/get/scan/etc methods that take a table name to HConnection?) Long term a design based on asynchhbase with a thin synchronous layer on top is probably the best option. was (Author: lhofhansl): @Stack: I think that about sums it up. The complexity of layers and timeout stories are alleviated somewhat by parent HBASE-4805 (no per HTable threadpool, HTablePool no longer needed). I had a brief look at the first issue, unless I am missing something this would require a nontrivial amount of refactoring. The simplest would be to do all network IO from the Connection thread rather than the application thread (as described in HBASE-4956). Would need allow for the client to synchronize and retrieved exceptions on/from a Future. Short term, should we take HBASE-4805 all the way and a getTable(...) method to HConnection? (Or even further and add put/get/scan/etc methods that take a table name to HConnection?) Long term a design based on asynchhbase with a thin synchronous layer on top is probably the best option. Allow HBaseAmin to use an existing connection - Key: HBASE-5058 URL: https://issues.apache.org/jira/browse/HBASE-5058 Project: HBase Issue Type: Sub-task Components: client Affects Versions: 0.94.0 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Minor Fix For: 0.94.0 Attachments: 5058-v2.txt, 5058-v3.txt, 5058-v3.txt, 5058.txt What HBASE-4805 does for HTables, this should do for HBaseAdmin. Along with this the shared error handling and retrying between HBaseAdmin and HConnectionManager can also be improved. I'll attach a first pass patch soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4895) Change tablename format in meta to be the UUID of the tablename rather than the tablename.
[ https://issues.apache.org/jira/browse/HBASE-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173370#comment-13173370 ] jirapos...@reviews.apache.org commented on HBASE-4895: -- bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/HRegionInfo.java, line 29 bq. https://reviews.apache.org/r/3188/diff/3/?file=64523#file64523line29 bq. bq. If this is an md5 under the wraps, maybe we should just do md5 rather than do this uuid indirection? But maybe the UUID class has some facility you like that makes it easier to work with? I'm down with moving to an md5 bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/HRegionInfo.java, line 340 bq. https://reviews.apache.org/r/3188/diff/3/?file=64523#file64523line340 bq. bq. Why line here? woops bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/HRegionInfo.java, line 352 bq. https://reviews.apache.org/r/3188/diff/3/?file=64523#file64523line352 bq. bq. Were we talking about uuids in original code? bq. bq. Should we cache tablename in HRI if we are passed it so can avoid a meta hit if absent? bq. bq. If a meta hit to get table name, its in the last HRI only? Is that the plan? The last HRI in a table has the table name? Or if not this, where is it in the meta table? That's true, i must have gotten that comment in my previous patch (https://reviews.apache.org/r/3186/) I assumed the tablename was in the hregioninfo. Not sure what the third question means. bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/HRegionInfo.java, line 398 bq. https://reviews.apache.org/r/3188/diff/3/?file=64523#file64523line398 bq. bq. Whats UUID tablename? And though its not you, whats the 1|2 about? The 1 or 2 is how you know it's the last region I can make it more clear i'm sure. bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/HRegionInfo.java, line 422 bq. https://reviews.apache.org/r/3188/diff/3/?file=64523#file64523line422 bq. bq. Something is wrong w/ this patch ? We had a '@return The UUID of the Table name' in original src? Woops bq. On 2011-12-20 00:48:41, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/client/MetaSearchRow.java, line 21 bq. https://reviews.apache.org/r/3188/diff/3/?file=64525#file64525line21 bq. bq. MetaSearchRow is not in src, its brought in by another related patch? So this is a patch on top of that patch? https://reviews.apache.org/r/3186/ - Alex --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3188/#review3992 --- On 2011-12-13 23:36:44, Alex Newman wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3188/ bq. --- bq. bq. (Updated 2011-12-13 23:36:44) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. PART 2 of hbase-4616 bq. bq. By uuiding the tablename in the metarow, it enables us to be able to use binary values for the end of table marker bq. bq. bq. This addresses bug HBASE-4895. bq. https://issues.apache.org/jira/browse/HBASE-4895 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/HRegionInfo.java 74cb821 bq.src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java e5e60a8 bq.src/main/java/org/apache/hadoop/hbase/client/MetaSearchRow.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/util/Merge.java 67d0fda bq.src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java 95712dd bq. src/test/java/org/apache/hadoop/hbase/coprocessor/SampleRegionWALObserver.java ff9c502 bq.src/test/java/org/apache/hadoop/hbase/coprocessor/TestClassLoading.java 368a0e5 bq.src/test/java/org/apache/hadoop/hbase/coprocessor/TestWALObserver.java 36dd289 bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionInfo.java 6e1211b bq.src/test/java/org/apache/hadoop/hbase/rest/TestStatusResource.java cffdcb6 bq.src/test/ruby/hbase/admin_test.rb 0c2672b bq. bq. Diff: https://reviews.apache.org/r/3188/diff bq. bq. bq. Testing bq. --- bq. bq. bq. Thanks, bq. bq. Alex bq. bq. Change tablename format in meta to be the UUID of the tablename rather than the tablename. -- Key: HBASE-4895
[jira] [Commented] (HBASE-5074) support checksums in HBase block cache
[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173373#comment-13173373 ] Andrew Purtell commented on HBASE-5074: --- +1 support checksums in HBase block cache -- Key: HBASE-5074 URL: https://issues.apache.org/jira/browse/HBASE-5074 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read into the HBase block cache actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173377#comment-13173377 ] Phabricator commented on HBASE-5072: Kannan has commented on the revision [jira] [HBASE-5072] Support Max Value for Per-Store Metrics. Ok -- go it. You are tracking the max only across all CFs. Sounds good. Thanks for the clarification. I misread the code there. REVISION DETAIL https://reviews.facebook.net/D945 Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173382#comment-13173382 ] Phabricator commented on HBASE-5072: Kannan has commented on the revision [jira] [HBASE-5072] Support Max Value for Per-Store Metrics. s/go it/got it. REVISION DETAIL https://reviews.facebook.net/D945 Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5074) support checksums in HBase block cache
[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173385#comment-13173385 ] Jean-Daniel Cryans commented on HBASE-5074: --- This jira's title make it sound like you want to checksum when reading from the block cache. support checksums in HBase block cache -- Key: HBASE-5074 URL: https://issues.apache.org/jira/browse/HBASE-5074 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read into the HBase block cache actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5074) support checksums in HBase block cache
[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173388#comment-13173388 ] stack commented on HBASE-5074: -- Where in the read pipeline would we verify the checksum? Down in hfile? Where would we do the exception processing forcing reread with checksum=on? Also down in hfile? (Nice idea BTW) support checksums in HBase block cache -- Key: HBASE-5074 URL: https://issues.apache.org/jira/browse/HBASE-5074 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read into the HBase block cache actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5033: --- Attachment: D933.3.patch Liyin updated the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Reviewers: Kannan, mbautin, Karthik, JIRA Refactor the code REVISION DETAIL https://reviews.facebook.net/D933 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/Store.java src/main/java/org/apache/hadoop/hbase/util/Threads.java Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-5072: --- Attachment: HBASE-5072.patch note: patch applies cleanly to both 89-fb trunk Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-5072: --- Status: Patch Available (was: Open) Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-5072: --- Resolution: Fixed Status: Resolved (was: Patch Available) Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173407#comment-13173407 ] Phabricator commented on HBASE-5072: nspiegelberg has committed the revision [jira] [HBASE-5072] Support Max Value for Per-Store Metrics. REVISION DETAIL https://reviews.facebook.net/D945 COMMIT https://reviews.facebook.net/rHBASE1221419 Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4698) Let the HFile Pretty Printer print all the key values for a specific row.
[ https://issues.apache.org/jira/browse/HBASE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HBASE-4698: -- Status: Patch Available (was: Open) Let the HFile Pretty Printer print all the key values for a specific row. - Key: HBASE-4698 URL: https://issues.apache.org/jira/browse/HBASE-4698 Project: HBase Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D111.1.patch, D111.1.patch, D111.1.patch, D111.2.patch, D111.3.patch, D111.4.patch, HBASE-4689-trunk.patch When using HFile Pretty Printer to debug HBase issues, it would very nice to allow the Pretty Printer to seek to a specific row, and only print all the key values for this row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4698) Let the HFile Pretty Printer print all the key values for a specific row.
[ https://issues.apache.org/jira/browse/HBASE-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173425#comment-13173425 ] Hadoop QA commented on HBASE-4698: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508009/HBASE-4689-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/555//console This message is automatically generated. Let the HFile Pretty Printer print all the key values for a specific row. - Key: HBASE-4698 URL: https://issues.apache.org/jira/browse/HBASE-4698 Project: HBase Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D111.1.patch, D111.1.patch, D111.1.patch, D111.2.patch, D111.3.patch, D111.4.patch, HBASE-4689-trunk.patch When using HFile Pretty Printer to debug HBase issues, it would very nice to allow the Pretty Printer to seek to a specific row, and only print all the key values for this row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173428#comment-13173428 ] Phabricator commented on HBASE-5033: lhofhansl has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:536 Should these be daemon threads? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:814 Same here: daemon threads? Does this have to be a separate pool from the opener pool? (I guess yes, but just want to make sure) src/main/java/org/apache/hadoop/hbase/regionserver/Store.java:294 This is a bit confusing to me. We already have a thread pool to open the stores, now we have another pool to open storefiles in each store. So in the worst case with the default pool size of 10 we could open 10*10 store files in parallel? Should there be different config options for the number of stores (i.e. CFs) in parallel and the number of store files per store to be opened in parallel? REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5070) Constraints implementation and javadoc changes
[ https://issues.apache.org/jira/browse/HBASE-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173439#comment-13173439 ] jirapos...@reviews.apache.org commented on HBASE-5070: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3273/ --- Review request for hbase, Gary Helmling, Ted Yu, and Michael Stack. Summary --- Follow-up on changes to constraint as per stack's comments on HBASE-4605. This addresses bug HBASE-5070. https://issues.apache.org/jira/browse/HBASE-5070 Diffs - src/docbkx/book.xml bd3f881 src/main/java/org/apache/hadoop/hbase/constraint/BaseConstraint.java 7ce6d45 src/main/java/org/apache/hadoop/hbase/constraint/Constraint.java 2d8b4d7 src/main/java/org/apache/hadoop/hbase/constraint/Constraints.java 7825466 src/main/java/org/apache/hadoop/hbase/constraint/package-info.java 6145ed5 src/test/java/org/apache/hadoop/hbase/constraint/CheckConfigurationConstraint.java c49098d Diff: https://reviews.apache.org/r/3273/diff Testing --- mvn clean test -P localTests -Dest=*Constraint* - all tests pass. Thanks, Jesse Constraints implementation and javadoc changes -- Key: HBASE-5070 URL: https://issues.apache.org/jira/browse/HBASE-5070 Project: HBase Issue Type: Task Reporter: Zhihong Yu This is continuation of HBASE-4605 See Stack's comments https://reviews.apache.org/r/2579/#review3980 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173457#comment-13173457 ] Phabricator commented on HBASE-5033: Liyin has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/Store.java:294 As you suggested, it has already used 2 separate parameters to control the number of stores and store files, which will be opened or closed in parallel. For example: hbase.hregion.storeCloser.threads.max is to control the number of parallel closing stores. While hbase.hregion.storeFileCloser.threads.max is to control the number of parallel closing store files. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:814 It would be easy to control the life cycle of thread pool and decouple the dependency of open and close operation if using separate thread pool. There are some details explanation in the previous comments :) src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:536 I am not quite sure whether we need to set these threads as daemons since the thread pool will be shutdown in the finally block anyway. The main thread shall never leave any tasks running in these thread pools after the finally block. Is there any specific reason ? or it will always be safe to set these threads as daemons ? REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173460#comment-13173460 ] Phabricator commented on HBASE-5033: lhofhansl has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Looks good then. Thanks for the explanation. Also please see my comment on the jira (this is only helping with request latency and might possibly be detrimental to aggregate throughput). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/regionserver/Store.java:294 Oops... Didn't see ...storeOpener vs ...storeFileOpener. I think traditionally we'd name them store.opener... and storefile.opener... src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:814 Since the threadpool is shutdown in a finally clause it is probably ok. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5074) support checksums in HBase block cache
[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173461#comment-13173461 ] dhruba borthakur commented on HBASE-5074: - Yes, the verification of the checksums would happen when the hfile block is loaded into the block cache. it will be entirely in hfile code. also, the exception processing would happen in hfile too. support checksums in HBase block cache -- Key: HBASE-5074 URL: https://issues.apache.org/jira/browse/HBASE-5074 Project: HBase Issue Type: Improvement Components: regionserver Reporter: dhruba borthakur Assignee: dhruba borthakur The current implementation of HDFS stores the data in one block file and the metadata(checksum) in another block file. This means that every read into the HBase block cache actually consumes two disk iops, one to the datafile and one to the checksum file. This is a major problem for scaling HBase, because HBase is usually bottlenecked on the number of random disk iops that the storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5073) Registered listeners not getting removed leading to memory leak in HBaseAdmin
[ https://issues.apache.org/jira/browse/HBASE-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173462#comment-13173462 ] Lars Hofhansl commented on HBASE-5073: -- +1 Maybe in another jira we should either disallow passing a Watcher (since unremovable listeners will be added to it), or clean up the listeners. That applies to 0.92 and trunk as well. Registered listeners not getting removed leading to memory leak in HBaseAdmin - Key: HBASE-5073 URL: https://issues.apache.org/jira/browse/HBASE-5073 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 Attachments: HBASE-5073.patch HBaseAdmin apis like tableExists(), flush, split, closeRegion uses catalog tracker. Every time Root node tracker and meta node tracker are started and a listener is registered. But after the operations are performed the listeners are not getting removed. Hence if the admin apis are consistently used then it may lead to memory leak. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5070) Constraints implementation and javadoc changes
[ https://issues.apache.org/jira/browse/HBASE-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173466#comment-13173466 ] jirapos...@reviews.apache.org commented on HBASE-5070: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3273/#review4022 --- src/docbkx/book.xml https://reviews.apache.org/r/3273/#comment9121 Should read 'checking is enabled' src/docbkx/book.xml https://reviews.apache.org/r/3273/#comment9122 When would the URL be active ? It is not available now. src/main/java/org/apache/hadoop/hbase/constraint/Constraint.java https://reviews.apache.org/r/3273/#comment9123 Whitespace should be removed. src/main/java/org/apache/hadoop/hbase/constraint/Constraints.java https://reviews.apache.org/r/3273/#comment9124 I think this should start with 'Constraint Class ' - Ted On 2011-12-20 19:14:46, Jesse Yates wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3273/ bq. --- bq. bq. (Updated 2011-12-20 19:14:46) bq. bq. bq. Review request for hbase, Gary Helmling, Ted Yu, and Michael Stack. bq. bq. bq. Summary bq. --- bq. bq. Follow-up on changes to constraint as per stack's comments on HBASE-4605. bq. bq. bq. This addresses bug HBASE-5070. bq. https://issues.apache.org/jira/browse/HBASE-5070 bq. bq. bq. Diffs bq. - bq. bq.src/docbkx/book.xml bd3f881 bq.src/main/java/org/apache/hadoop/hbase/constraint/BaseConstraint.java 7ce6d45 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraint.java 2d8b4d7 bq.src/main/java/org/apache/hadoop/hbase/constraint/Constraints.java 7825466 bq.src/main/java/org/apache/hadoop/hbase/constraint/package-info.java 6145ed5 bq. src/test/java/org/apache/hadoop/hbase/constraint/CheckConfigurationConstraint.java c49098d bq. bq. Diff: https://reviews.apache.org/r/3273/diff bq. bq. bq. Testing bq. --- bq. bq. mvn clean test -P localTests -Dest=*Constraint* - all tests pass. bq. bq. bq. Thanks, bq. bq. Jesse bq. bq. Constraints implementation and javadoc changes -- Key: HBASE-5070 URL: https://issues.apache.org/jira/browse/HBASE-5070 Project: HBase Issue Type: Task Reporter: Zhihong Yu This is continuation of HBASE-4605 See Stack's comments https://reviews.apache.org/r/2579/#review3980 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173498#comment-13173498 ] Hudson commented on HBASE-5063: --- Integrated in HBase-0.92-security #46 (See [https://builds.apache.org/job/HBase-0.92-security/46/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time
[ https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173502#comment-13173502 ] Phabricator commented on HBASE-5033: Liyin has commented on the revision [jira][HBASE-5033][[89-fb]]Opening/Closing store in parallel to reduce region open/close time. Thanks Lars for the reviewing :) I just read your comments in the jira and sorry to miss it at first. I totally agree with you that we should be very careful about over-parallel and overwhelm the region server and name node too much. So these configuration parameter really matters. Also considering we are still processing each message about region open and region close at one time, we may not get a throughput win too much by parallelizing the store/store file open and close process. REVISION DETAIL https://reviews.facebook.net/D933 Opening/Closing store in parallel to reduce region open/close time -- Key: HBASE-5033 URL: https://issues.apache.org/jira/browse/HBASE-5033 Project: HBase Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang Attachments: D933.1.patch, D933.2.patch, D933.3.patch Region servers are opening/closing each store and each store file for every store in sequential fashion, which may cause inefficiency to open/close regions. So this diff is to open/close each store in parallel in order to reduce region open/close time. Also it would help to reduce the cluster restart time. 1) Opening each store in parallel 2) Loading each store file for every store in parallel 3) Closing each store in parallel 4) Closing each store file for every store in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4218) Delta Encoding of KeyValues (aka prefix compression)
[ https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173514#comment-13173514 ] Phabricator commented on HBASE-4218: mbautin has commented on the revision [jira] [HBASE-4218] Delta encoding for keys in HFile. Replying to the rest of comments. A new version of the patch will follow. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:65 Added missing javadoc for includingMemstoreTS. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoder.java:126 seekBefore only matters in case of an exact match. I will update the javadoc. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:34 Updated. src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:147 Added an assertion. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestBufferedDeltaEncoder.java:34 Fixed. src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestDeltaEncoders.java:47 Fixed (LargeTests -- runs in 2 minutes). src/test/java/org/apache/hadoop/hbase/io/deltaencoder/TestBufferedDeltaEncoder.java:34 Fixed (SmallTests). src/test/java/org/apache/hadoop/hbase/util/TestByteBufferUtils.java:35 Fixed (SmallTests) REVISION DETAIL https://reviews.facebook.net/D447 Delta Encoding of KeyValues (aka prefix compression) - Key: HBASE-4218 URL: https://issues.apache.org/jira/browse/HBASE-4218 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jacek Migdal Assignee: Mikhail Bautin Labels: compression Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, D447.1.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, Delta_encoding_with_memstore_TS.patch, open-source.diff A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms, It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter. Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression: key compression ratio: 92% total compression ratio: 85% LZO on the same data: 85% LZO after delta encoding: 91% While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit. It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields). In order to implement it in HBase two important changes in design will be needed: -solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance -extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal) Link to a discussion about something similar: http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windowssubj=Re+prefix+compression -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Attachment: 5078.txt Check if we should heartbeat/report progress every time we open a file. Removed unused local variable totalBytesToSplit Added new openedNewFile boolean that is set every time we create a new file and then cleared each time we go to check if we should report progress. Removed hard tabs. Added some to the summary log message. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173548#comment-13173548 ] Jean-Daniel Cryans commented on HBASE-5077: --- This is from the master's POV: {quote} 2011-12-20 02:59:42,086 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 2011-12-20 02:59:42,089 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 0 2011-12-20 02:59:42,113 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r13s38,62023,1324345934996 2011-12-20 03:00:09,244 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 2011-12-20 03:00:09,302 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 3 2011-12-20 03:02:53,072 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r28s44,62023,1324345934970 2011-12-20 03:03:21,117 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 2011-12-20 03:03:21,136 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 6 2011-12-20 03:04:40,421 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r6s38,62023,1324345935082 2011-12-20 03:05:09,133 INFO org.apache.hadoop.hbase.master.SplitLogManager: resubmitting task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 2011-12-20 03:05:09,144 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 ver = 9 2011-12-20 03:05:09,193 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 acquired by sv4r30s44,62023,1324345935039 2011-12-20 03:05:36,137 INFO org.apache.hadoop.hbase.master.SplitLogManager: Skipping resubmissions of task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 because threshold 3 reached ... 2011-12-20 03:05:47,139 INFO org.apache.hadoop.hbase.master.SplitLogManager: Skipping resubmissions of task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 because threshold 3 reached 2011-12-20 03:05:50,320 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 entered state done sv4r30s44,62023,1324345935039 {quote} The one that died is sv4r6s38, the 3rd one to acquire the task. Here's its log: {quote} 2011-12-20 03:04:40,418 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r6s38,62023,1324345935082 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 2011-12-20 03:04:43,574 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Priority: Critical (was: Major) I think this pretty critical; I couldn't successfully split a log for a long period of time as the log splitting was moved about among machines failing on each. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173560#comment-13173560 ] Zhihong Yu commented on HBASE-5078: --- Nice finding. {code} +// timeout of if that not set, the split log DEFAULT_TIMEOUT) {code} The above should read 'timeout or if ...' {code} +// ignore edits from this region. It doesn't ezist anymore. {code} exist was spelled incorrectly. {code} continue; } else { logWriters.put(region, wap); } + openedNewFile = true; {code} Assignment to openedNewFile depends on the continue statement. It would be better to move the assignment to the else block. Or to remove else block and put logWriters.put() call together with the new assignment. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173567#comment-13173567 ] Mubarak Seyed commented on HBASE-4720: -- when i ran the tests, it fails at {code} Running org.apache.hadoop.hbase.util.TestRegionSplitCalculator Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.077 sec FAILURE! Failed tests: testSplitCalculatorEq(org.apache.hadoop.hbase.util.TestRegionSplitCalculator): expected:2 but was:1 {code} Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173569#comment-13173569 ] Zhihong Yu commented on HBASE-5077: --- After preemption log, the following code should have run: {code} void stopTask() { LOG.info(Sending interrupt to stop the worker thread); worker.interrupt(); // TODO interrupt often gets swallowed, do what else? } {code} I think the following method should have been called instead: {code} public void stop() { exitWorker = true; stopTask(); } {code} SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173574#comment-13173574 ] Jean-Daniel Cryans commented on HBASE-5077: --- Won't exitWorker kill the SplitLogWorker fully? Like not just the task, but the RS will actually stop serving log splitting. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173575#comment-13173575 ] Jean-Daniel Cryans commented on HBASE-5077: --- One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally: {quote} if ((progress_failed == false) (reporter != null) (reporter.progress() == false)) { progress_failed = true; } {quote} But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a return false which would be correctly handled by SplitLogWorker. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173575#comment-13173575 ] Jean-Daniel Cryans edited comment on HBASE-5077 at 12/20/11 10:35 PM: -- One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally: {code} if ((progress_failed == false) (reporter != null) (reporter.progress() == false)) { progress_failed = true; } {code} But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a return false which would be correctly handled by SplitLogWorker. was (Author: jdcryans): One problem I see is that in HLogSplitter.splitLogFileToTemp we do this in the finally: {quote} if ((progress_failed == false) (reporter != null) (reporter.progress() == false)) { progress_failed = true; } {quote} But at this point progress_failed isn't taken into account so the method returns true. Looking at other parts of that method it seems it's missing a return false which would be correctly handled by SplitLogWorker. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5072) Support Max Value for Per-Store Metrics
[ https://issues.apache.org/jira/browse/HBASE-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173578#comment-13173578 ] Hudson commented on HBASE-5072: --- Integrated in HBase-TRUNK #2563 (See [https://builds.apache.org/job/HBase-TRUNK/2563/]) [jira] [HBASE-5072] Support Max Value for Per-Store Metrics Summary: We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. Test Plan: - mvn test -Dtest=TestRegionServerMetrics Reviewers: JIRA, mbautin, Kannan Reviewed By: Kannan CC: stack, nspiegelberg, mbautin, Kannan Differential Revision: 945 nspiegelberg : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaMetrics.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Support Max Value for Per-Store Metrics --- Key: HBASE-5072 URL: https://issues.apache.org/jira/browse/HBASE-5072 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.94.0 Attachments: D945.1.patch, D945.2.patch, HBASE-5072.patch We were bit in our multi-tenant cluster because one of our Stores encountered a bug and grew its StoreFile count. We didn't notice this because the StoreFile count currently reported by the RegionServer is an average of all Stores in the region. For the per-Store metrics, we should also record the max so we can notice outliers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173577#comment-13173577 ] Zhihong Yu commented on HBASE-5077: --- To answer J-D's question, let me reference the following code from taskLoop(): {code} } catch (InterruptedException e) { LOG.info(SplitLogWorker interrupted while waiting for task, + exiting: + e.toString()); assert exitWorker == true; return; } {code} where exitWorker was expected to be true. I think the assertion wasn't triggered at runtime. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173584#comment-13173584 ] Jean-Daniel Cryans commented on HBASE-5078: --- ZK operations are quite expensive, instead of doing it for every file it'd be better to do it every 2 or 3 files. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5079) DistributedLogSplitter interrupt can be hazardous to regionserver health
DistributedLogSplitter interrupt can be hazardous to regionserver health Key: HBASE-5079 URL: https://issues.apache.org/jira/browse/HBASE-5079 Project: HBase Issue Type: Bug Reporter: stack The DLS interrupt can kill the regionserver if happens while conversation w/ namenode is going on. The interrupt is used to end a task on regionserver when done whether successful or to interrupt an ongoing split since assumed by another server. I saw this issue testing because I was killing servers. I also was suffering HBASE-5078 DistributedLogSplitter failing to split file because it has edits for lots of regions which made it more likely to happen. Here is what it looks like on the regionserver that died: {code} 2011-12-20 17:54:58,009 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463 preempted from sv4r13s38,7003,1324365396583, current task state and owner=owned sv4r27s44,7003,1324365396664 2011-12-20 17:54:58,009 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2011-12-20 17:54:59,133 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463 preempted from sv4r13s38,7003,1324365396583, current task state and owner=owned sv4r27s44,7003,1324365396664 2011-12-20 17:54:59,134 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread ... 2011-12-20 17:55:25,505 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463 preempted from sv4r13s38,7003,1324365396583, current task state and owner=unassigned sv4r11s38,7001,1324365395047 2011-12-20 17:55:25,505 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread {code} Three interrupts are sent over period of 31 seconds or so. Eventually the interrupt has an effect and I get: {code} 2011-12-20 17:55:25,505 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2011-12-20 17:55:48,022 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested 2011-12-20 17:55:58,070 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Call to sv4r11s38/10.4.11.38:7000 failed on local exception: java.nio.channels.ClosedByInterruptException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1103) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy9.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826) Caused by: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:779) at
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173592#comment-13173592 ] Jean-Daniel Cryans commented on HBASE-5077: --- I now understand how I got all the way to closing the files without aborting the splitting, the interrupt is being retried by the DFSClient: {quote} 2011-12-20 03:05:09,194 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 preempted from sv4r6s38,62023,1324345935082, current task state and owner=owned sv4r30s44,62023,1324345935039 2011-12-20 03:05:09,194 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2011-12-20 03:05:09,214 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.28.44:51010, add to deadNodes and continue java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.java:2354) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2033) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.seekToBlockSource(DFSClient.java:2483) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2119) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2150) at java.io.DataInputStream.read(DataInputStream.java:132) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getNextLogLine(HLogSplitter.java:764) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:402) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) 2011-12-20 03:05:09,216 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.12.38:51010, add to deadNodes and continue java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) ... 2011-12-20 03:05:09,220 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.14.38:51010, add to deadNodes and continue java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) ... 2011-12-20 03:05:09,223 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2118163224139708562_43382 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... {quote} SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173601#comment-13173601 ] stack commented on HBASE-5078: -- I'll make changes lads. Thanks for feedback. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HBASE-5021: --- Attachment: D849.3.patch nspiegelberg updated the revision [jira] [HBase-5021] Enforce upper bound on timestamp. Reviewers: Kannan, Liyin, JIRA Talked with Kannan about the latest timestamp bug. My last iteration didn't fix the original issue. Fixing adding it to the unit test REVISION DETAIL https://reviews.facebook.net/D849 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/HConstants.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173610#comment-13173610 ] Phabricator commented on HBASE-5021: Kannan has accepted the revision [jira] [HBase-5021] Enforce upper bound on timestamp. looks good! REVISION DETAIL https://reviews.facebook.net/D849 Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Status: Open (was: Patch Available) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Attachment: 5078-v2.txt How is this? Addresses Ted and J-D comments. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Status: Patch Available (was: Open) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173638#comment-13173638 ] Jean-Daniel Cryans commented on HBASE-5078: --- Oh also don't bother with progress_failed, I'm going to remove it in HBASE-5077. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173640#comment-13173640 ] Jean-Daniel Cryans commented on HBASE-5078: --- Oh also don't bother with progress_failed, I'm going to remove it in HBASE-5077. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173637#comment-13173637 ] Jean-Daniel Cryans commented on HBASE-5078: --- I don't see how everyNopenedFiles is defined in that patch and I don't find it in my file. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mubarak Seyed updated HBASE-4720: - Attachment: HBASE-4720.trunk.v1.patch The attached file (HBASE-4720.trunk.v1.patch) contains changes after rebased on TRUNK. Thanks. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-5021: --- Status: Patch Available (was: Open) Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-5021: --- Resolution: Fixed Status: Resolved (was: Patch Available) Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173643#comment-13173643 ] Zhihong Yu commented on HBASE-5078: --- How about naming everyNopenedFiles as numOpenedFilesBeforeReporting ? DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173648#comment-13173648 ] Phabricator commented on HBASE-5021: nspiegelberg has committed the revision [jira] [HBase-5021] Enforce upper bound on timestamp. REVISION DETAIL https://reviews.facebook.net/D849 COMMIT https://reviews.facebook.net/rHBASE1221532 Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173650#comment-13173650 ] Zhihong Yu commented on HBASE-4720: --- {code} [ERROR] /Users/zhihyu/trunk-hbase/src/main/java/org/apache/hadoop/hbase/rest/RootResource.java:[108,9] cannot find symbol [ERROR] symbol : class CheckAndPutTableResource [ERROR] location: class org.apache.hadoop.hbase.rest.RootResource [ERROR] [ERROR] /Users/zhihyu/trunk-hbase/src/main/java/org/apache/hadoop/hbase/rest/RootResource.java:[114,9] cannot find symbol [ERROR] symbol : class CheckAndDeleteTableResource [ERROR] location: class org.apache.hadoop.hbase.rest.RootResource [ERROR] {code} I think some new files were not added as part of TRUNK patch. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5021) Enforce upper bound on timestamp
[ https://issues.apache.org/jira/browse/HBASE-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173656#comment-13173656 ] Jean-Daniel Cryans commented on HBASE-5021: --- Nicolas, as this changes some behaviors and adds a configuration option, would you mind adding a release note for this jira? Thanks. Enforce upper bound on timestamp Key: HBASE-5021 URL: https://issues.apache.org/jira/browse/HBASE-5021 Project: HBase Issue Type: Improvement Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Critical Fix For: 0.94.0 Attachments: D849.1.patch, D849.2.patch, D849.3.patch, HBASE-5021-trunk.patch We have been getting hit with performance problems on our time-series database due to invalid timestamps being inserted by the timestamp. We are working on adding proper checks to app server, but production performance could be severely impacted with significant recovery time if something slips past. Since timestamps are considered a fundamental part of the HBase schema multiple optimizations use timestamp information, we should allow the option to sanity check the upper bound on the server-side in HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans reassigned HBASE-5077: - Assignee: Jean-Daniel Cryans SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mubarak Seyed updated HBASE-4720: - Attachment: HBASE-4720.trunk.v1.patch Sorry for the inconvenience, i forgot to do 'svn add file' before the patch. The attached file contains updated patch. Thanks. Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.trunk.v1.patch, HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5077: -- Attachment: HBASE-5077.patch Adds the missing return false (I saw it was already fixed in 0.89-fb) and removed progress_failed since it doesn't do anything. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5077: -- Status: Patch Available (was: Open) SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173664#comment-13173664 ] Zhihong Yu commented on HBASE-5077: --- Patch looks good. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-4720: -- Attachment: (was: HBASE-4720.trunk.v1.patch) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4720) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server
[ https://issues.apache.org/jira/browse/HBASE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-4720: -- Attachment: (was: HBASE-4720.trunk.v1.patch) Implement atomic update operations (checkAndPut, checkAndDelete) for REST client/server Key: HBASE-4720 URL: https://issues.apache.org/jira/browse/HBASE-4720 Project: HBase Issue Type: Improvement Reporter: Daniel Lord Assignee: Mubarak Seyed Fix For: 0.94.0 Attachments: HBASE-4720.trunk.v1.patch, HBASE-4720.v1.patch, HBASE-4720.v3.patch I have several large application/HBase clusters where an application node will occasionally need to talk to HBase from a different cluster. In order to help ensure some of my consistency guarantees I have a sentinel table that is updated atomically as users interact with the system. This works quite well for the regular hbase client but the REST client does not implement the checkAndPut and checkAndDelete operations. This exposes the application to some race conditions that have to be worked around. It would be ideal if the same checkAndPut/checkAndDelete operations could be supported by the REST client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Status: Patch Available (was: Open) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Status: Open (was: Patch Available) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5078: - Attachment: 5078-v3.txt Address Ted comment and J-Ds suggestion I not use progress_failed. Hows this? DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173679#comment-13173679 ] Jean-Daniel Cryans commented on HBASE-5078: --- +1 on v3. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-5078: -- Comment: was deleted (was: Oh also don't bother with progress_failed, I'm going to remove it in HBASE-5077.) DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5078) DistributedLogSplitter failing to split file because it has edits for lots of regions
[ https://issues.apache.org/jira/browse/HBASE-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173678#comment-13173678 ] Hadoop QA commented on HBASE-5078: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508154/5078-v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause mvn compile goal to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/557//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/557//console This message is automatically generated. DistributedLogSplitter failing to split file because it has edits for lots of regions - Key: HBASE-5078 URL: https://issues.apache.org/jira/browse/HBASE-5078 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 5078-v2.txt, 5078-v3.txt, 5078.txt Testing 0.92.0RC, ran into interesting issue where a log file had edits for many regions and just opening the file per region was taking so long, we were never updating our progress and so the split of the log just kept failing; in this case, the first 40 edits in a file required our opening 35 files -- opening 35 files took longer than the hard-coded 25 seconds its supposed to take acquiring the task. First, here is master's view: {code} 2011-12-20 17:54:09,184 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 ver = 0 ... 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 acquired by sv4r27s44,7003,1324365396664 ... 2011-12-20 17:54:35,475 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403573033 ver = 3 {code} Master then gives it elsewhere. Over on the regionserver we see: {code} 2011-12-20 17:54:09,233 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker sv4r27s44,7003,1324365396664 acquired task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 2011-12-20 17:54:10,714 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://sv4r11s38:7000/hbase/splitlog/sv4r27s44,7003,1324365396664_hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679/TestTable/6b6bfc2716dff952435ab26f018648b2/recovered.ed its/0278862.temp, syncFs=true, hflush=false {code} and so on till: {code} 2011-12-20 17:54:36,876 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 preempted from sv4r27s44,7003,1324365396664, current task state and owner=owned sv4r28s44,7003,1324365396678 2011-12-20 17:54:37,112 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: Failed to heartbeat the task/hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403487679 {code} When above happened, we'd only processed 40 edits. As written, we only heatbeat every 1024 edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-5077) SplitLogWorker fails to let go of a task, kills the RS
[ https://issues.apache.org/jira/browse/HBASE-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173682#comment-13173682 ] stack commented on HBASE-5077: -- Chatting w/ J-D, we shouldn't return out of middle of finally -- should go through to end via the file closes. SplitLogWorker fails to let go of a task, kills the RS -- Key: HBASE-5077 URL: https://issues.apache.org/jira/browse/HBASE-5077 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Fix For: 0.92.1 Attachments: HBASE-5077.patch I hope I didn't break spacetime continuum, I got this while testing 0.92.0: {quote} 2011-12-20 03:06:19,838 FATAL org.apache.hadoop.hbase.regionserver.SplitLogWorker: logic error - end task /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 done failed because task doesn't exist org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A9100%2Fhbase%2F.logs%2Fsv4r14s38%2C62023%2C1324345935047-splitting%2Fsv4r14s38%252C62023%252C1324345935047.1324349363814 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:654) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.endTask(SplitLogWorker.java:372) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) {quote} I'll post more logs in a moment. What I can see is that the master shuffled that task around a bit and one of the region servers died on this stack trace while the others were able to interrupt themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira