[jira] [Updated] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-13871: --- Attachment: HBASE-13871.patch Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13848) Access InfoServer SSL passwords through Credential Provder API
[ https://issues.apache.org/jira/browse/HBASE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578670#comment-14578670 ] Hadoop QA commented on HBASE-13848: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738535/HBASE-13848.1.patch against master branch at commit 487e4aa74fcc6ef4201f6ffdcfd1a7169c754562. ATTACHMENT ID: 12738535 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestRegionRebalancing Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14343//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14343//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14343//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14343//console This message is automatically generated. Access InfoServer SSL passwords through Credential Provder API -- Key: HBASE-13848 URL: https://issues.apache.org/jira/browse/HBASE-13848 Project: HBase Issue Type: Improvement Components: security Reporter: Sean Busbey Assignee: Sean Busbey Attachments: HBASE-13848.1.patch, HBASE-13848.1.patch HBASE-11810 took care of getting our SSL passwords out of the Hadoop Credential Provider API, but we also get several out of clear text configuration for the InfoServer class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13855) NPE in PartitionedMobCompactor
[ https://issues.apache.org/jira/browse/HBASE-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578640#comment-14578640 ] ramkrishna.s.vasudevan commented on HBASE-13855: +1 for commit. NPE in PartitionedMobCompactor -- Key: HBASE-13855 URL: https://issues.apache.org/jira/browse/HBASE-13855 Project: HBase Issue Type: Sub-task Components: mob Affects Versions: hbase-11339 Reporter: Jingcheng Du Assignee: Jingcheng Du Fix For: hbase-11339 Attachments: HBASE-13855.diff In PartitionedMobCompactor, mob files are split into partitions, the compactions of partitions run in parallel. The partitions share the same set of del files. There might be race conditions when open readers of del store files in each partition which can cause NPE. In this patch, we will pre-create the reader for each del store file to avoid this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13836) Do not reset the mvcc for bulk loaded mob reference cells in reading
[ https://issues.apache.org/jira/browse/HBASE-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578660#comment-14578660 ] Anoop Sam John commented on HBASE-13836: -if (this.cur != null this.reader.isBulkLoaded()) { +if (this.cur != null !this.reader.isSkipResetSeqId()) { It will be better for readability that this check is this.cur != null this.reader.isBulkLoaded() !this.reader.isSkipResetSeqId() Also no need to call setter on reader for non bulk loaded cases.. Just set it only for bulk loaded files. And pls add some comments that why we do this skip. Do not reset the mvcc for bulk loaded mob reference cells in reading Key: HBASE-13836 URL: https://issues.apache.org/jira/browse/HBASE-13836 Project: HBase Issue Type: Sub-task Components: mob Affects Versions: hbase-11339 Reporter: Jingcheng Du Assignee: Jingcheng Du Fix For: hbase-11339 Attachments: HBASE-13836.diff Now in scanner, the cells mvcc of the bulk loaded files are reset by the seqId parsed from the file name. We need to skip this if the hfiles are bulkloaded in mob compactions. In mob compaction, the bulk loaded ref cell might not be the latest cell among the ones that have the same row key. In reading, the mvcc is reset by the largest one, it will cover the real latest ref cell. We have to skip the resetting in this case. The solution is we add a new field to fileinfo, when this field is set as true, we skip the resetting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13836) Do not reset the mvcc for bulk loaded mob reference cells in reading
[ https://issues.apache.org/jira/browse/HBASE-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578649#comment-14578649 ] ramkrishna.s.vasudevan commented on HBASE-13836: {code} public static final byte[] SKIP_RESET_SEQ_ID = Bytes.toBytes(SKIP_RESET_SEQ_ID); {code} May be you can add the reason and specify used in MOB cases. It will help to understand later because it is specific to MOB. Also you can always make this 'true' and avoid the setSkipResetSeqId(true) every time for a non-bulk load case? If is a bulk load you can set based on the metadata info. {code} private boolean skipResetSeqId = false; {code} Do not reset the mvcc for bulk loaded mob reference cells in reading Key: HBASE-13836 URL: https://issues.apache.org/jira/browse/HBASE-13836 Project: HBase Issue Type: Sub-task Components: mob Affects Versions: hbase-11339 Reporter: Jingcheng Du Assignee: Jingcheng Du Fix For: hbase-11339 Attachments: HBASE-13836.diff Now in scanner, the cells mvcc of the bulk loaded files are reset by the seqId parsed from the file name. We need to skip this if the hfiles are bulkloaded in mob compactions. In mob compaction, the bulk loaded ref cell might not be the latest cell among the ones that have the same row key. In reading, the mvcc is reset by the largest one, it will cover the real latest ref cell. We have to skip the resetting in this case. The solution is we add a new field to fileinfo, when this field is set as true, we skip the resetting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13872) Typo in dfshealth.html - Decomissioning
nijel created HBASE-13872: - Summary: Typo in dfshealth.html - Decomissioning Key: HBASE-13872 URL: https://issues.apache.org/jira/browse/HBASE-13872 Project: HBase Issue Type: Bug Reporter: nijel Assignee: nijel Priority: Trivial div class=page-headerh1smallDecomissioning/small/h1/div change to div class=page-headerh1smallDecommissioning/small/h1/div in dfshealth.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13451) Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators
[ https://issues.apache.org/jira/browse/HBASE-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578616#comment-14578616 ] Hudson commented on HBASE-13451: FAILURE: Integrated in HBase-TRUNK #6554 (See [https://builds.apache.org/job/HBase-TRUNK/6554/]) HBASE-13451 - Make the HFileBlockIndex blockKeys to Cells so that it could (ramkrishna: rev 487e4aa74fcc6ef4201f6ffdcfd1a7169c754562) * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java * hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java * hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java * hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV3.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/CompoundBloomFilterWriter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/BloomFilterChunk.java * hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionFileSystem.java * hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileSeek.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/BloomFilterFactory.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/CompoundBloomFilter.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompoundBloomFilter.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/CompoundBloomFilterBase.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderImpl.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CompoundBloomFilterWriter.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CompoundBloomFilter.java * hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java * hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CompoundBloomFilterBase.java Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators -- Key: HBASE-13451 URL: https://issues.apache.org/jira/browse/HBASE-13451 Project: HBase Issue Type: Sub-task Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Attachments: HBASE-13451.patch, HBASE-13451_1.patch, HBASE-13451_2.patch, HBASE-13451_3.patch, HBASE-13451_4.patch, HBASE-13451_5.patch, HBASE-13451_6.patch, HBASE-13451_7.patch After HBASE-10800 we could ensure that all the blockKeys in the BlockReader are converted to Cells (KeyOnlyKeyValue) so that we could use CellComparators. Note that this can be done only for the keys that are written using CellComparators and not for the ones using RawBytesComparator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13829) Add more ThrottleType
[ https://issues.apache.org/jira/browse/HBASE-13829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578769#comment-14578769 ] Matteo Bertozzi commented on HBASE-13829: - +1 on v2 Add more ThrottleType - Key: HBASE-13829 URL: https://issues.apache.org/jira/browse/HBASE-13829 Project: HBase Issue Type: Improvement Components: Client Reporter: Guanghao Zhang Assignee: Guanghao Zhang Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13829-v1.patch, HBASE-13829-v2.patch, HBASE-13829.patch HBASE-11598 add simple throttling for hbase. But in the client, it doesn't support user to set ThrottleType like WRITE_NUM, WRITE_SIZE, READ_NUM, READ_SIZE. REVIEW BOARD: https://reviews.apache.org/r/34989/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578773#comment-14578773 ] ramkrishna.s.vasudevan commented on HBASE-13871: LGTM. nit, createFirstOnRow can have a small doc. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578790#comment-14578790 ] Hadoop QA commented on HBASE-13871: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738544/HBASE-13871.patch against master branch at commit 487e4aa74fcc6ef4201f6ffdcfd1a7169c754562. ATTACHMENT ID: 12738544 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14344//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14344//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14344//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14344//console This message is automatically generated. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13755) Provide single super user check implementation
[ https://issues.apache.org/jira/browse/HBASE-13755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578392#comment-14578392 ] Hadoop QA commented on HBASE-13755: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738491/HBASE-13755-v4.patch against master branch at commit c62b396f9f4537d121e2661f328674c99a8b7fb7. ATTACHMENT ID: 12738491 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14341//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14341//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14341//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14341//console This message is automatically generated. Provide single super user check implementation -- Key: HBASE-13755 URL: https://issues.apache.org/jira/browse/HBASE-13755 Project: HBase Issue Type: Improvement Reporter: Anoop Sam John Assignee: Mikhail Antonov Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13755-v1.patch, HBASE-13755-v2.patch, HBASE-13755-v3.patch, HBASE-13755-v3.patch, HBASE-13755-v3.patch, HBASE-13755-v3.patch, HBASE-13755-v3.patch, HBASE-13755-v4.patch Followup for HBASE-13375. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13870) Aborted major compaction does not clean up after itself
Gautam Gopalakrishnan created HBASE-13870: - Summary: Aborted major compaction does not clean up after itself Key: HBASE-13870 URL: https://issues.apache.org/jira/browse/HBASE-13870 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 0.98.10 Reporter: Gautam Gopalakrishnan Priority: Minor When a major compaction is aborted, incomplete HFiles can be left behind in the .tmp directory of the region. They are cleaned up when the region is reassigned but in long running clusters, this does not happen often leading to excess disk usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13847) getWriteRequestCount function in HRegionServer uses int variable to return the count.
[ https://issues.apache.org/jira/browse/HBASE-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578388#comment-14578388 ] Hudson commented on HBASE-13847: FAILURE: Integrated in HBase-1.1 #531 (See [https://builds.apache.org/job/HBase-1.1/531/]) HBASE-13847 getWriteRequestCount function in HRegionServer uses int variable to return the count (Abhilash) (stack: rev 4afad40d5a935b5e5de2cd65ad7ffc8a28d1e893) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java getWriteRequestCount function in HRegionServer uses int variable to return the count. - Key: HBASE-13847 URL: https://issues.apache.org/jira/browse/HBASE-13847 Project: HBase Issue Type: Bug Components: hbase, regionserver Affects Versions: 1.0.0 Reporter: Abhilash Assignee: Abhilash Labels: easyfix Fix For: 2.0.0, 1.0.1, 1.2.0, 1.1.1 Attachments: HBASE-13847.patch, HBASE-13847.patch, HBASE-13847.patch, HBASE-13847.patch, screenshot-1.png Variable used to return the value of getWriteRequestCount is int, must be long. I think it causes cluster UI to show negative Write Request Count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13847) getWriteRequestCount function in HRegionServer uses int variable to return the count.
[ https://issues.apache.org/jira/browse/HBASE-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578382#comment-14578382 ] Hudson commented on HBASE-13847: FAILURE: Integrated in HBase-TRUNK #6553 (See [https://builds.apache.org/job/HBase-TRUNK/6553/]) HBASE-13847 getWriteRequestCount function in HRegionServer uses int variable to return the count (Abhilash) (stack: rev c62b396f9f4537d121e2661f328674c99a8b7fb7) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java getWriteRequestCount function in HRegionServer uses int variable to return the count. - Key: HBASE-13847 URL: https://issues.apache.org/jira/browse/HBASE-13847 Project: HBase Issue Type: Bug Components: hbase, regionserver Affects Versions: 1.0.0 Reporter: Abhilash Assignee: Abhilash Labels: easyfix Fix For: 2.0.0, 1.0.1, 1.2.0, 1.1.1 Attachments: HBASE-13847.patch, HBASE-13847.patch, HBASE-13847.patch, HBASE-13847.patch, screenshot-1.png Variable used to return the value of getWriteRequestCount is int, must be long. I think it causes cluster UI to show negative Write Request Count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
Anoop Sam John created HBASE-13871: -- Summary: Change RegionScannerImpl to deal with Cell instead of byte[], int, int Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Reporter: Anoop Sam John Assignee: Anoop Sam John This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13848) Access InfoServer SSL passwords through Credential Provder API
[ https://issues.apache.org/jira/browse/HBASE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13848: Status: Patch Available (was: Open) Access InfoServer SSL passwords through Credential Provder API -- Key: HBASE-13848 URL: https://issues.apache.org/jira/browse/HBASE-13848 Project: HBase Issue Type: Improvement Components: security Reporter: Sean Busbey Assignee: Sean Busbey Attachments: HBASE-13848.1.patch, HBASE-13848.1.patch HBASE-11810 took care of getting our SSL passwords out of the Hadoop Credential Provider API, but we also get several out of clear text configuration for the InfoServer class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13855) NPE in PartitionedMobCompactor
[ https://issues.apache.org/jira/browse/HBASE-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578484#comment-14578484 ] Anoop Sam John commented on HBASE-13855: +1 [~jmhsieh] Pls check and commit ? NPE in PartitionedMobCompactor -- Key: HBASE-13855 URL: https://issues.apache.org/jira/browse/HBASE-13855 Project: HBase Issue Type: Sub-task Components: mob Affects Versions: hbase-11339 Reporter: Jingcheng Du Assignee: Jingcheng Du Fix For: hbase-11339 Attachments: HBASE-13855.diff In PartitionedMobCompactor, mob files are split into partitions, the compactions of partitions run in parallel. The partitions share the same set of del files. There might be race conditions when open readers of del store files in each partition which can cause NPE. In this patch, we will pre-create the reader for each del store file to avoid this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13848) Access InfoServer SSL passwords through Credential Provder API
[ https://issues.apache.org/jira/browse/HBASE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13848: Status: Open (was: Patch Available) Access InfoServer SSL passwords through Credential Provder API -- Key: HBASE-13848 URL: https://issues.apache.org/jira/browse/HBASE-13848 Project: HBase Issue Type: Improvement Components: security Reporter: Sean Busbey Assignee: Sean Busbey Attachments: HBASE-13848.1.patch HBASE-11810 took care of getting our SSL passwords out of the Hadoop Credential Provider API, but we also get several out of clear text configuration for the InfoServer class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578492#comment-14578492 ] Hadoop QA commented on HBASE-13871: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738516/HBASE-13871.patch against master branch at commit c62b396f9f4537d121e2661f328674c99a8b7fb7. ATTACHMENT ID: 12738516 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1917 checkstyle errors (more than the master's current 1912 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestMinVersions org.apache.hadoop.hbase.regionserver.TestKeepDeletes org.apache.hadoop.hbase.regionserver.TestScanWithBloomError org.apache.hadoop.hbase.regionserver.TestStoreFileRefresherChore org.apache.hadoop.hbase.regionserver.TestScanner org.apache.hadoop.hbase.filter.TestInvocationRecordFilter org.apache.hadoop.hbase.procedure.TestProcedureManager org.apache.hadoop.hbase.regionserver.TestResettingCounters Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14342//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14342//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14342//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14342//console This message is automatically generated. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13848) Access InfoServer SSL passwords through Credential Provder API
[ https://issues.apache.org/jira/browse/HBASE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated HBASE-13848: Attachment: HBASE-13848.1.patch reattaching for another run. based on conversation on HBASE-13811 I'm sure the failure in TestDistributedLogSplitting isn't related. Let's see if the fix from there helped. Access InfoServer SSL passwords through Credential Provder API -- Key: HBASE-13848 URL: https://issues.apache.org/jira/browse/HBASE-13848 Project: HBase Issue Type: Improvement Components: security Reporter: Sean Busbey Assignee: Sean Busbey Attachments: HBASE-13848.1.patch, HBASE-13848.1.patch HBASE-11810 took care of getting our SSL passwords out of the Hadoop Credential Provider API, but we also get several out of clear text configuration for the InfoServer class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13870) Aborted major compaction does not clean up after itself
[ https://issues.apache.org/jira/browse/HBASE-13870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578409#comment-14578409 ] Anoop Sam John commented on HBASE-13870: Planning to give a patch? Aborted major compaction does not clean up after itself --- Key: HBASE-13870 URL: https://issues.apache.org/jira/browse/HBASE-13870 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 0.98.10 Reporter: Gautam Gopalakrishnan Priority: Minor When a major compaction is aborted, incomplete HFiles can be left behind in the .tmp directory of the region. They are cleaned up when the region is reassigned but in long running clusters, this does not happen often leading to excess disk usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13451) Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators
[ https://issues.apache.org/jira/browse/HBASE-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-13451: --- Attachment: HBASE-13451_7.patch Patch that was committed. Thanks for the reviews Ted, Stack and Anoop. Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators -- Key: HBASE-13451 URL: https://issues.apache.org/jira/browse/HBASE-13451 Project: HBase Issue Type: Sub-task Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Attachments: HBASE-13451.patch, HBASE-13451_1.patch, HBASE-13451_2.patch, HBASE-13451_3.patch, HBASE-13451_4.patch, HBASE-13451_5.patch, HBASE-13451_6.patch, HBASE-13451_7.patch After HBASE-10800 we could ensure that all the blockKeys in the BlockReader are converted to Cells (KeyOnlyKeyValue) so that we could use CellComparators. Note that this can be done only for the keys that are written using CellComparators and not for the ones using RawBytesComparator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13451) Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators
[ https://issues.apache.org/jira/browse/HBASE-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-13451: --- Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to master. Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators -- Key: HBASE-13451 URL: https://issues.apache.org/jira/browse/HBASE-13451 Project: HBase Issue Type: Sub-task Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Attachments: HBASE-13451.patch, HBASE-13451_1.patch, HBASE-13451_2.patch, HBASE-13451_3.patch, HBASE-13451_4.patch, HBASE-13451_5.patch, HBASE-13451_6.patch, HBASE-13451_7.patch After HBASE-10800 we could ensure that all the blockKeys in the BlockReader are converted to Cells (KeyOnlyKeyValue) so that we could use CellComparators. Note that this can be done only for the keys that are written using CellComparators and not for the ones using RawBytesComparator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13451) Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators
[ https://issues.apache.org/jira/browse/HBASE-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578417#comment-14578417 ] Anoop Sam John commented on HBASE-13451: Thanks [~ram_krish] for the nice work.. It was a much needed cleanup and we reduce a lot of short living object creations (KeyOnlyKeyValue) Make the HFileBlockIndex blockKeys to Cells so that it could be easy to use in the CellComparators -- Key: HBASE-13451 URL: https://issues.apache.org/jira/browse/HBASE-13451 Project: HBase Issue Type: Sub-task Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Attachments: HBASE-13451.patch, HBASE-13451_1.patch, HBASE-13451_2.patch, HBASE-13451_3.patch, HBASE-13451_4.patch, HBASE-13451_5.patch, HBASE-13451_6.patch, HBASE-13451_7.patch After HBASE-10800 we could ensure that all the blockKeys in the BlockReader are converted to Cells (KeyOnlyKeyValue) so that we could use CellComparators. Note that this can be done only for the keys that are written using CellComparators and not for the ones using RawBytesComparator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13836) Do not reset the mvcc for bulk loaded mob reference cells in reading
[ https://issues.apache.org/jira/browse/HBASE-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jingcheng Du updated HBASE-13836: - Attachment: HBASE-13836.diff Upload the patch. Please take a look, thanks a lot! Do not reset the mvcc for bulk loaded mob reference cells in reading Key: HBASE-13836 URL: https://issues.apache.org/jira/browse/HBASE-13836 Project: HBase Issue Type: Sub-task Components: mob Affects Versions: hbase-11339 Reporter: Jingcheng Du Assignee: Jingcheng Du Fix For: hbase-11339 Attachments: HBASE-13836.diff Now in scanner, the cells mvcc of the bulk loaded files are reset by the seqId parsed from the file name. We need to skip this if the hfiles are bulkloaded in mob compactions. In mob compaction, the bulk loaded ref cell might not be the latest cell among the ones that have the same row key. In reading, the mvcc is reset by the largest one, it will cover the real latest ref cell. We have to skip the resetting in this case. The solution is we add a new field to fileinfo, when this field is set as true, we skip the resetting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13865) Default value of hbase.hregion.memstore.block.multiplier in HBase book is wrong
[ https://issues.apache.org/jira/browse/HBASE-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578374#comment-14578374 ] Vladimir Rodionov commented on HBASE-13865: --- It does, but I would move default (4) to HConstants. Default value of hbase.hregion.memstore.block.multiplier in HBase book is wrong --- Key: HBASE-13865 URL: https://issues.apache.org/jira/browse/HBASE-13865 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 2.0.0 Reporter: Vladimir Rodionov Priority: Trivial Attachments: HBASE-13865.1.patch Its 4 in the book and 2 in a current master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-13871: --- Fix Version/s: 2.0.0 Status: Patch Available (was: Open) Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-13871: --- Attachment: HBASE-13871.patch Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13840) Master UI should rename column labels from KVs to Cell
[ https://issues.apache.org/jira/browse/HBASE-13840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George updated HBASE-13840: Component/s: regionserver Master UI should rename column labels from KVs to Cell -- Key: HBASE-13840 URL: https://issues.apache.org/jira/browse/HBASE-13840 Project: HBase Issue Type: Bug Components: master, regionserver, UI Affects Versions: 1.1.0 Reporter: Lars George Fix For: 2.0.0, 1.2.0 Currently the master UI still refers to KVs in some of the tables. We should do a sweep and rename to Cell. Also do for RS templates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13840) Master UI should rename column labels from KVs to Cell
[ https://issues.apache.org/jira/browse/HBASE-13840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578977#comment-14578977 ] Lars George commented on HBASE-13840: - This also applies to the region server UI. I will rename the issue. Master UI should rename column labels from KVs to Cell -- Key: HBASE-13840 URL: https://issues.apache.org/jira/browse/HBASE-13840 Project: HBase Issue Type: Bug Components: master, regionserver, UI Affects Versions: 1.1.0 Reporter: Lars George Fix For: 2.0.0, 1.2.0 Currently the master UI still refers to KVs in some of the tables. We should do a sweep and rename to Cell. Also do for RS templates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13873) LoadTestTool addAuthInfoToConf throws UnsupportedOperationException
sunyerui created HBASE-13873: Summary: LoadTestTool addAuthInfoToConf throws UnsupportedOperationException Key: HBASE-13873 URL: https://issues.apache.org/jira/browse/HBASE-13873 Project: HBase Issue Type: Bug Components: integration tests Affects Versions: 0.98.13 Reporter: sunyerui Fix For: 0.98.14 When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {panel} 2015-06-09 22:15:33,605 ERROR \[main\] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {panel} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new ArraListString(Arrays.asList(userList.split(,))); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13873) LoadTestTool addAuthInfoToConf throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/HBASE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579114#comment-14579114 ] Anoop Sam John commented on HBASE-13873: Planning to put a patch? LoadTestTool addAuthInfoToConf throws UnsupportedOperationException --- Key: HBASE-13873 URL: https://issues.apache.org/jira/browse/HBASE-13873 Project: HBase Issue Type: Bug Components: integration tests Affects Versions: 0.98.13 Reporter: sunyerui Fix For: 0.98.14 When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {code} 2015-06-09 22:15:33,605 ERROR [main] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {code} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new ArrayListString(Arrays.asList(userList.split(,))); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13873) LoadTestTool addAuthInfoToConf throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/HBASE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sunyerui updated HBASE-13873: - Description: When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {code} 2015-06-09 22:15:33,605 ERROR [main] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {code} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new ArrayListString(Arrays.asList(userList.split(,))); {code} was: When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {panel} 2015-06-09 22:15:33,605 ERROR \[main\] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {panel} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new
[jira] [Updated] (HBASE-13840) Server UIs should rename column labels from KVs to Cell
[ https://issues.apache.org/jira/browse/HBASE-13840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George updated HBASE-13840: Summary: Server UIs should rename column labels from KVs to Cell (was: Master UI should rename column labels from KVs to Cell) Server UIs should rename column labels from KVs to Cell --- Key: HBASE-13840 URL: https://issues.apache.org/jira/browse/HBASE-13840 Project: HBase Issue Type: Bug Components: master, regionserver, UI Affects Versions: 1.1.0 Reporter: Lars George Fix For: 2.0.0, 1.2.0 Currently the master UI still refers to KVs in some of the tables. We should do a sweep and rename to Cell. Also do for RS templates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13873) LoadTestTool addAuthInfoToConf throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/HBASE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sunyerui updated HBASE-13873: - Description: When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {panel} 2015-06-09 22:15:33,605 ERROR \[main\] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {panel} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new ArrayListString(Arrays.asList(userList.split(,))); {code} was: When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {panel} 2015-06-09 22:15:33,605 ERROR \[main\] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {panel} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new
[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579205#comment-14579205 ] Ted Yu commented on HBASE-13329: lgtm Memstore flush fails if data has always the same value, breaking the region --- Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Ruben Aguiar Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.1 Attachments: 13329-v1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71) at
[jira] [Updated] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry He updated HBASE-13845: - Attachment: HBASE-13845-branch-1.patch Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579239#comment-14579239 ] Anoop Sam John commented on HBASE-13871: Yes this class need not be top level.. Let me see how we can make it inner. bq.5252 this.isScan = scan.isGetScan() ? -1 : 0;5252this.isScan = scan.isGetScan() ? 1 : 0; Yes this change is intended. Pls see the compare in isStopROw is changed. So we need this change. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579253#comment-14579253 ] Nick Dimiduk commented on HBASE-13845: -- Nice one [~jerryhe] Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579196#comment-14579196 ] Ruben Aguiar commented on HBASE-13329: -- Done Memstore flush fails if data has always the same value, breaking the region --- Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Ruben Aguiar Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.1 Attachments: 13329-v1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71) at
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579219#comment-14579219 ] Jerry He commented on HBASE-13845: -- Thanks for the review, [~stack] Committed to branch-1.1 and branch-1. I am thinking about putting only the test case into master branch, as a regression guide. What do you think, [~stack]? Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13829) Add more ThrottleType
[ https://issues.apache.org/jira/browse/HBASE-13829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579147#comment-14579147 ] Hudson commented on HBASE-13829: FAILURE: Integrated in HBase-TRUNK #6555 (See [https://builds.apache.org/job/HBase-TRUNK/6555/]) HBASE-13829 Add more ThrottleType (Guanghao Zhang) (tedyu: rev 6cc42c8cd16d01cded9936bf53bf35e6e2ff5b66) * hbase-shell/src/main/ruby/shell/commands/set_quota.rb * hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestQuotaThrottle.java * hbase-client/src/main/java/org/apache/hadoop/hbase/quotas/ThrottleSettings.java * hbase-client/src/main/java/org/apache/hadoop/hbase/quotas/ThrottleType.java * src/main/asciidoc/_chapters/ops_mgt.adoc * hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * hbase-client/src/main/java/org/apache/hadoop/hbase/quotas/QuotaSettingsFactory.java * hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestQuotaAdmin.java * hbase-shell/src/main/ruby/hbase/quotas.rb Add more ThrottleType - Key: HBASE-13829 URL: https://issues.apache.org/jira/browse/HBASE-13829 Project: HBase Issue Type: Improvement Components: Client Reporter: Guanghao Zhang Assignee: Guanghao Zhang Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13829-v1.patch, HBASE-13829-v2.patch, HBASE-13829.patch HBASE-11598 add simple throttling for hbase. But in the client, it doesn't support user to set ThrottleType like WRITE_NUM, WRITE_SIZE, READ_NUM, READ_SIZE. REVIEW BOARD: https://reviews.apache.org/r/34989/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579228#comment-14579228 ] stack commented on HBASE-13871: --- Do we have to have a class named FirstOnRowFakeCell at top level of our class hierarchy? Is it only used in CellUtil? Could it be an inner class of it? Any reason for this change? 5252 this.isScan = scan.isGetScan() ? -1 : 0; 5252 this.isScan = scan.isGetScan() ? 1 : 0; Otherwise, seems good. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruben Aguiar updated HBASE-13329: - Attachment: 13329-v1.patch Memstore flush fails if data has always the same value, breaking the region --- Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Ruben Aguiar Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.1 Attachments: 13329-v1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71) at
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579236#comment-14579236 ] Hadoop QA commented on HBASE-13845: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738613/HBASE-13845-branch-1.patch against branch-1 branch at commit 6cc42c8cd16d01cded9936bf53bf35e6e2ff5b66. ATTACHMENT ID: 12738613 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14346//console This message is automatically generated. Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579412#comment-14579412 ] Jerry He commented on HBASE-13845: -- it seems there was a glitch applying the branch-1 patch from Hadoop QA. It was applied fine with 'git apply' locally. There was mentioning on the hbase dev mailing list recently as I recall? Committed the test case to the master branch. Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch, HBASE-13845-master-test-case-only.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-13378) RegionScannerImpl synchronized for READ_UNCOMMITTED Isolation Levels
[ https://issues.apache.org/jira/browse/HBASE-13378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579309#comment-14579309 ] Andrew Purtell edited comment on HBASE-13378 at 6/9/15 5:53 PM: {quote} bq. Up to this point READ_UNCOMMITED meant: You might see partially finished rows. bq. Now it means: You might see partially finished rows, and you may not see cells that have existed when the scanner started. This sounds to me like an incompatible change for a patch release {quote} Precisely, why does this sound like an *incompatible* change? was (Author: apurtell): {quote} bq. Up to this point READ_UNCOMMITED meant: You might see partially finished rows. Now it means: You might see partially finished rows, and you may not see cells that have existed when the scanner started. This sounds to me like an incompatible change for a patch release {quote} Precisely, why does this sound like an *incompatible* change? RegionScannerImpl synchronized for READ_UNCOMMITTED Isolation Levels Key: HBASE-13378 URL: https://issues.apache.org/jira/browse/HBASE-13378 Project: HBase Issue Type: New Feature Reporter: John Leach Assignee: John Leach Priority: Minor Attachments: HBASE-13378.patch, HBASE-13378.txt Original Estimate: 2h Time Spent: 2h Remaining Estimate: 0h This block of code below coupled with the close method could be changed so that READ_UNCOMMITTED does not synchronize. {CODE:JAVA} // synchronize on scannerReadPoints so that nobody calculates // getSmallestReadPoint, before scannerReadPoints is updated. IsolationLevel isolationLevel = scan.getIsolationLevel(); synchronized(scannerReadPoints) { this.readPt = getReadpoint(isolationLevel); scannerReadPoints.put(this, this.readPt); } {CODE} This hotspots for me under heavy get requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry He updated HBASE-13845: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch, HBASE-13845-master-test-case-only.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579266#comment-14579266 ] stack commented on HBASE-13845: --- bq. I am thinking about putting only the test case into master branch, as a regression guide. What do you think, stack? I think it is a good idea. +1 Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry He updated HBASE-13845: - Attachment: HBASE-13845-master-test-case-only.patch Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch, HBASE-13845-master-test-case-only.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13378) RegionScannerImpl synchronized for READ_UNCOMMITTED Isolation Levels
[ https://issues.apache.org/jira/browse/HBASE-13378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579309#comment-14579309 ] Andrew Purtell commented on HBASE-13378: {quote} bq. Up to this point READ_UNCOMMITED meant: You might see partially finished rows. Now it means: You might see partially finished rows, and you may not see cells that have existed when the scanner started. This sounds to me like an incompatible change for a patch release {quote} Precisely, why does this sound like an *incompatible* change? RegionScannerImpl synchronized for READ_UNCOMMITTED Isolation Levels Key: HBASE-13378 URL: https://issues.apache.org/jira/browse/HBASE-13378 Project: HBase Issue Type: New Feature Reporter: John Leach Assignee: John Leach Priority: Minor Attachments: HBASE-13378.patch, HBASE-13378.txt Original Estimate: 2h Time Spent: 2h Remaining Estimate: 0h This block of code below coupled with the close method could be changed so that READ_UNCOMMITTED does not synchronize. {CODE:JAVA} // synchronize on scannerReadPoints so that nobody calculates // getSmallestReadPoint, before scannerReadPoints is updated. IsolationLevel isolationLevel = scan.getIsolationLevel(); synchronized(scannerReadPoints) { this.readPt = getReadpoint(isolationLevel); scannerReadPoints.put(this, this.readPt); } {CODE} This hotspots for me under heavy get requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579382#comment-14579382 ] Hudson commented on HBASE-13845: SUCCESS: Integrated in HBase-1.2 #140 (See [https://builds.apache.org/job/HBase-1.2/140/]) HBASE-13845 Expire of one region server carrying meta can bring down the master (jerryjch: rev d37d9c43de6c919a58ff34548a36af1e22e6cc2a) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMetaShutdownHandler.java Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13873) LoadTestTool addAuthInfoToConf throws UnsupportedOperationException
[ https://issues.apache.org/jira/browse/HBASE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579318#comment-14579318 ] Andrew Purtell commented on HBASE-13873: Would you mind providing a patch [~sunyerui] ? LoadTestTool addAuthInfoToConf throws UnsupportedOperationException --- Key: HBASE-13873 URL: https://issues.apache.org/jira/browse/HBASE-13873 Project: HBase Issue Type: Bug Components: integration tests Affects Versions: 0.98.13 Reporter: sunyerui Fix For: 0.98.14 When run IntegrationTestIngestWithACL on distributed clusters with kerberos security enabled, the method addAuthInfoToConf() in LoadTestTool will be invoked and throws UnsupportedOperationException, stack as follows: {code} 2015-06-09 22:15:33,605 ERROR [main] util.AbstractHBaseTool: Error running command-line tool java.lang.UnsupportedOperationException at java.util.AbstractList.add(AbstractList.java:148) at java.util.AbstractList.add(AbstractList.java:108) at org.apache.hadoop.hbase.util.LoadTestTool.addAuthInfoToConf(LoadTestTool.java:811) at org.apache.hadoop.hbase.util.LoadTestTool.loadTable(LoadTestTool.java:516) at org.apache.hadoop.hbase.util.LoadTestTool.doWork(LoadTestTool.java:479) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.hbase.IntegrationTestIngest.runIngestTest(IntegrationTestIngest.java:151) at org.apache.hadoop.hbase.IntegrationTestIngest.internalRunIngestTest(IntegrationTestIngest.java:114) at org.apache.hadoop.hbase.IntegrationTestIngest.runTestFromCommandLine(IntegrationTestIngest.java:97) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:115) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.IntegrationTestIngestWithACL.main(IntegrationTestIngestWithACL.java:136) {code} The corresponding code is below and the reason is obvious. Arrays.asList return a java.util.Arrays$ArrayList but not java.util.ArrayList. Both of them are inherited from java.util.AbstractList, but the former didn't override the method add(), so the parent method java.util.AbstractList.add() will be invoked and the exception threw. {code} private void addAuthInfoToConf(Properties authConfig, Configuration conf, String owner, String userList) throws IOException { ListString users = Arrays.asList(userList.split(,)); users.add(owner); ... } {code} Does anyone occurred on this? I think it's an obvious bug but no one report it, so please tell me if I misunderstanding it. If it's actually a bug here, then it can be fixed very easy as below: {code} ListString users = new ArrayListString(Arrays.asList(userList.split(,))); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HBASE-13862) TestRegionRebalancing is flaky avain
[ https://issues.apache.org/jira/browse/HBASE-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Soldatov reassigned HBASE-13862: --- Assignee: Sergey Soldatov TestRegionRebalancing is flaky avain Key: HBASE-13862 URL: https://issues.apache.org/jira/browse/HBASE-13862 Project: HBase Issue Type: Bug Components: test Affects Versions: 2.0.0 Reporter: Mikhail Antonov Assignee: Sergey Soldatov I can reproduce it by running mvn test -Dtest=TestRegionRebalancing on fresh master about 1 out of 3-4 runs. {code} unning org.apache.hadoop.hbase.TestRegionRebalancing 2015-06-08 12:00:52.125 java[45610:5873722] Unable to load realm info from SCDynamicStore Tests run: 2, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 26.743 sec FAILURE! - in org.apache.hadoop.hbase.TestRegionRebalancing testRebalanceOnRegionServerNumberChange[0](org.apache.hadoop.hbase.TestRegionRebalancing) Time elapsed: 15.599 sec FAILURE! java.lang.AssertionError: null at org.apache.hadoop.hbase.TestRegionRebalancing.testRebalanceOnRegionServerNumberChange(TestRegionRebalancing.java:144) testRebalanceOnRegionServerNumberChange[1](org.apache.hadoop.hbase.TestRegionRebalancing) Time elapsed: 10.671 sec FAILURE! java.lang.AssertionError: null at org.apache.hadoop.hbase.TestRegionRebalancing.testRebalanceOnRegionServerNumberChange(TestRegionRebalancing.java:144) Results : Failed tests: TestRegionRebalancing.testRebalanceOnRegionServerNumberChange:144 null TestRegionRebalancing.testRebalanceOnRegionServerNumberChange:144 null {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13871) Change RegionScannerImpl to deal with Cell instead of byte[], int, int
[ https://issues.apache.org/jira/browse/HBASE-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579279#comment-14579279 ] stack commented on HBASE-13871: --- bq. Pls see the compare in isStopROw is changed. So we need this change I see. Thanks. Change RegionScannerImpl to deal with Cell instead of byte[], int, int -- Key: HBASE-13871 URL: https://issues.apache.org/jira/browse/HBASE-13871 Project: HBase Issue Type: Sub-task Components: regionserver, Scanners Reporter: Anoop Sam John Assignee: Anoop Sam John Fix For: 2.0.0 Attachments: HBASE-13871.patch, HBASE-13871.patch This is also a sub item for splitting HBASE-13387 into smaller chunks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579351#comment-14579351 ] Esteban Gutierrez commented on HBASE-13874: --- Right HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD as a percentage doesn't makes sense. 20% of heap could be anywhere between 4GB to 20GB with large heap sizes. The easiest way is to make it configurable but we are just adding another knob for not being able to track all resources consumption. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579358#comment-14579358 ] Vladimir Rodionov commented on HBASE-13874: --- Should log WARN if HBase own heap is below, say 4GB. No errors should be thrown in any case. If lower minimum (4GB?), then thresholds needs to be adjusted and WARM message should be logged. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
stack created HBASE-13874: - Summary: Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12465) HBase master start fails due to incorrect file creations
[ https://issues.apache.org/jira/browse/HBASE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579354#comment-14579354 ] Alicia Ying Shu commented on HBASE-12465: - [~clayb] Which version of Hbase are you using? Is it a secure or unsecure cluster? Thanks. HBase master start fails due to incorrect file creations Key: HBASE-12465 URL: https://issues.apache.org/jira/browse/HBASE-12465 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.96.0 Environment: Ubuntu Reporter: Biju Nair Assignee: Alicia Ying Shu Labels: hbase, hbase-bulkload - Start of HBase master fails due to the following error found in the log. 2014-11-11 20:25:58,860 WARN org.apache.hadoop.hbase.backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePa th,file:hdfs:///hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_ on try #1 org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase,access=WRITE,inode=/hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_:devuser:supergroup:-rwxr-xr-x - All the files that hbase master was complaining about are created under an users user-id instead on hbase user resulting in incorrect access permission for the master to act on. - Looks like this was due to bulk load done using LoadIncrementalHFiles program. - HBASE-12052 is another scenario similar to this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579418#comment-14579418 ] Nick Dimiduk commented on HBASE-13874: -- bq. Should log WARN if HBase own heap is below, say 4GB Yes, I like this approach. What's our decided lower bound? 4g? 2g? Folks have been running with 8-12g total heap for ages; {{8 * 0.2 = 1.6}}. Let's warn if own heap is less than 1.5g or less than 20%, whichever is smaller. [~vrodionov] you think 1.5g is too low? Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13846) Run MiniCluster on top of other MiniDfsCluster
[ https://issues.apache.org/jira/browse/HBASE-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579492#comment-14579492 ] Jesse Yates commented on HBASE-13846: - Thanks for taking a look Ted! I assume otherwise, you are +1? Posting another patch momentarily; if QA is happy, I'll commit unless I hear otherwise. Run MiniCluster on top of other MiniDfsCluster -- Key: HBASE-13846 URL: https://issues.apache.org/jira/browse/HBASE-13846 Project: HBase Issue Type: Improvement Components: test Affects Versions: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Attachments: hbase-13846-0.98-v0.patch, hbase-13846-master-v0.patch Similar to how we don't start a mini-zk cluster when we already have one specified, this will skip starting a mini-dfs cluster if the user specifies a different one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13875) Clock skew between master and region server may render restored region without server address
Ted Yu created HBASE-13875: -- Summary: Clock skew between master and region server may render restored region without server address Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with server=ip-172-31-46-239.ec2.internal,51852,1433758173941 2015-06-08 14:28:41,920 DEBUG [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Finished post open deploy task for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a {code} What happened was that due to clock skew, server location (ip-172-31-46-239.ec2.internal) for the region was eclipsed by the delete marker put in by
[jira] [Commented] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579585#comment-14579585 ] Enis Soztutar commented on HBASE-13875: --- Otherwise looks good. +1 if tests pass. Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row
[jira] [Commented] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579584#comment-14579584 ] Enis Soztutar commented on HBASE-13875: --- You can remove the first line, and change the second one to be {{now+1}}. We do not need to add 20. {code} +// Threads.sleep(20); +addRegionsToMeta(connection, regionInfos, regionReplication, now+20); {code} Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.
[jira] [Commented] (HBASE-13846) Run MiniCluster on top of other MiniDfsCluster
[ https://issues.apache.org/jira/browse/HBASE-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579459#comment-14579459 ] Ted Yu commented on HBASE-13846: {code} 2985 * @param requireDown require the that cluster no be up before it is set. {code} I think I know what the above means: {code} 2985 * @param requireDown require that the cluster not be up before it is set. {code} Please add Apache license header to test class. Run MiniCluster on top of other MiniDfsCluster -- Key: HBASE-13846 URL: https://issues.apache.org/jira/browse/HBASE-13846 Project: HBase Issue Type: Improvement Components: test Affects Versions: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Attachments: hbase-13846-0.98-v0.patch, hbase-13846-master-v0.patch Similar to how we don't start a mini-zk cluster when we already have one specified, this will skip starting a mini-dfs cluster if the user specifies a different one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13846) Run MiniCluster on top of other MiniDfsCluster
[ https://issues.apache.org/jira/browse/HBASE-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesse Yates updated HBASE-13846: Attachment: hbase-13846-master-v1.patch Updated master version - fixing javadoc, extra imports header. Run MiniCluster on top of other MiniDfsCluster -- Key: HBASE-13846 URL: https://issues.apache.org/jira/browse/HBASE-13846 Project: HBase Issue Type: Improvement Components: test Affects Versions: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Attachments: hbase-13846-0.98-v0.patch, hbase-13846-master-v0.patch, hbase-13846-master-v1.patch Similar to how we don't start a mini-zk cluster when we already have one specified, this will skip starting a mini-dfs cluster if the user specifies a different one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13846) Run MiniCluster on top of other MiniDfsCluster
[ https://issues.apache.org/jira/browse/HBASE-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesse Yates updated HBASE-13846: Status: Patch Available (was: Open) Run MiniCluster on top of other MiniDfsCluster -- Key: HBASE-13846 URL: https://issues.apache.org/jira/browse/HBASE-13846 Project: HBase Issue Type: Improvement Components: test Affects Versions: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Attachments: hbase-13846-0.98-v0.patch, hbase-13846-master-v0.patch, hbase-13846-master-v1.patch Similar to how we don't start a mini-zk cluster when we already have one specified, this will skip starting a mini-dfs cluster if the user specifies a different one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13846) Run MiniCluster on top of other MiniDfsCluster
[ https://issues.apache.org/jira/browse/HBASE-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesse Yates updated HBASE-13846: Status: Open (was: Patch Available) Run MiniCluster on top of other MiniDfsCluster -- Key: HBASE-13846 URL: https://issues.apache.org/jira/browse/HBASE-13846 Project: HBase Issue Type: Improvement Components: test Affects Versions: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.1 Attachments: hbase-13846-0.98-v0.patch, hbase-13846-master-v0.patch, hbase-13846-master-v1.patch Similar to how we don't start a mini-zk cluster when we already have one specified, this will skip starting a mini-dfs cluster if the user specifies a different one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13671) More classes to add to the invoking repository of org.apache.hadoop.hbase.mapreduce.driver
[ https://issues.apache.org/jira/browse/HBASE-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579430#comment-14579430 ] li xiang commented on HBASE-13671: -- Thanks Jerry and Ted for reviewing ! More classes to add to the invoking repository of org.apache.hadoop.hbase.mapreduce.driver -- Key: HBASE-13671 URL: https://issues.apache.org/jira/browse/HBASE-13671 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: li xiang Assignee: li xiang Fix For: 2.0.0, 0.98.13, 1.2.0, 1.1.1 Attachments: HBASE-13671-v1.patch In org.apache.hadoop.hbase.mapreduce.driver, only the following classes are added and can be invoked by hbase-server.jar: - RowCounter - CellCounter - Export - Import - ImportTsv - LoadIncrementalHFiles - CopyTable - VerifyReplication More classes(valid programs) to add, such as ExportSnapshot, WALPlayer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-13876) Improving performance of HeapMemoryTunerManager
Abhilash created HBASE-13876: Summary: Improving performance of HeapMemoryTunerManager Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 1.1.0, 1.0.1, 2.0.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579432#comment-14579432 ] Hudson commented on HBASE-13845: FAILURE: Integrated in HBase-1.1 #532 (See [https://builds.apache.org/job/HBase-1.1/532/]) HBASE-13845 Expire of one region server carrying meta can bring down the master (jerryjch: rev f95430a8428a5a3e9bfd7b9d38d434eb298133ae) * hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java * hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMetaShutdownHandler.java Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch, HBASE-13845-master-test-case-only.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579484#comment-14579484 ] Esteban Gutierrez commented on HBASE-13874: --- I think we should fail fast in the same way we do now and only make this configurable if required. Finding the right thresholds will be always a problem for users. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579511#comment-14579511 ] Nick Dimiduk commented on HBASE-13329: -- I'd like this fix for 1.1.1 but would feel better about it if there was a test along with it. Memstore flush fails if data has always the same value, breaking the region --- Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Ruben Aguiar Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.1 Attachments: 13329-v1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121) at
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Attachment: HBASE-13876.patch Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-13875: --- Attachment: 13875-branch-1.txt Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with server=ip-172-31-46-239.ec2.internal,51852,1433758173941 2015-06-08 14:28:41,920 DEBUG
[jira] [Commented] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579570#comment-14579570 ] Ted Yu commented on HBASE-13875: Patch uses master timestamp for the mutations so that clock skew doesn't affect the correctness of restored table(s). Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row
[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master
[ https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579569#comment-14579569 ] Hudson commented on HBASE-13845: FAILURE: Integrated in HBase-TRUNK #6556 (See [https://builds.apache.org/job/HBase-TRUNK/6556/]) HBASE-13845 Expire of one region server carrying meta can bring down the master: test case (jerryjch: rev 14fe23254a78c52bdaef0da819268c8b405059cb) * hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMetaShutdownHandler.java Expire of one region server carrying meta can bring down the master --- Key: HBASE-13845 URL: https://issues.apache.org/jira/browse/HBASE-13845 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0, 1.1.0, 1.2.0 Reporter: Jerry He Assignee: Jerry He Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: HBASE-13845-branch-1.1.patch, HBASE-13845-branch-1.patch, HBASE-13845-master-test-case-only.patch There seems to be a code bug that can cause expiration of one region server carrying meta to bring down the master under certain case. Here is the sequence of event. a) The master detects the expiration of a region server on ZK, and starts to expire the region server. b) Since the failed region server carries meta, the shutdown handler will call verifyAndAssignMetaWithRetries() during processing the expired rs. c) In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation {code} (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(), this.server.getZooKeeper(), timeout)) { this.services.getAssignmentManager().assignMeta (HRegionInfo.FIRST_META_REGIONINFO); } else if (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation( this.server.getZooKeeper( { throw new IOException(hbase:meta is onlined on the dead server + serverName); {code} If we see the meta region is still alive on the expired rs, we throw an exception. We do some retries (default 10x1000ms) for verifyAndAssignMeta. If we still get the exception after retries, we abort the master. {code} 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: Master server abort: loaded coprocessors are: [] 2015-05-27 06:58:30,156 FATAL [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] master.HMaster: verifyAndAssignMeta failed after10 times retries, aborting java.io.IOException: hbase:meta is onlined on the dead server bdvs1164.svl.ibm.com,16020,1432681743203 at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-05-27 06:58:30,156 INFO [MASTER_META_SERVER_OPERATIONS-bdvs1163:6-0] regionserver.HRegionServer: STOPPED: verifyAndAssignMeta failed after10 times retries, aborting {code} The problem happens when the expired is slow processing its own expiration or has a slow death, and is still able to respond to master's meta verification in the meantime -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-13875: --- Status: Patch Available (was: Open) Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with server=ip-172-31-46-239.ec2.internal,51852,1433758173941 2015-06-08 14:28:41,920 DEBUG
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. was:I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579434#comment-14579434 ] Hadoop QA commented on HBASE-13329: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12738609/13329-v1.patch against master branch at commit 6cc42c8cd16d01cded9936bf53bf35e6e2ff5b66. ATTACHMENT ID: 12738609 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14345//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14345//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14345//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14345//console This message is automatically generated. Memstore flush fails if data has always the same value, breaking the region --- Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Ruben Aguiar Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.1 Attachments: 13329-v1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24
[jira] [Commented] (HBASE-13848) Access InfoServer SSL passwords through Credential Provder API
[ https://issues.apache.org/jira/browse/HBASE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579451#comment-14579451 ] Ted Yu commented on HBASE-13848: lgtm Access InfoServer SSL passwords through Credential Provder API -- Key: HBASE-13848 URL: https://issues.apache.org/jira/browse/HBASE-13848 Project: HBase Issue Type: Improvement Components: security Reporter: Sean Busbey Assignee: Sean Busbey Attachments: HBASE-13848.1.patch, HBASE-13848.1.patch HBASE-11810 took care of getting our SSL passwords out of the Hadoop Credential Provider API, but we also get several out of clear text configuration for the InfoServer class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579500#comment-14579500 ] Dave Latham commented on HBASE-13874: - I would definitely appreciate being able to use a config somehow besides recompiling as long as it is logically consistent, even if not recommended. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-13875: -- Fix Version/s: 1.1.1 1.2.0 2.0.0 Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-13875: --- Attachment: (was: 13875-branch-1.txt) Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with server=ip-172-31-46-239.ec2.internal,51852,1433758173941 2015-06-08 14:28:41,920
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when cluster is load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited to for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized of the derivative) of number of evictions and number of flushes in last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579494#comment-14579494 ] stack commented on HBASE-13874: --- As per Esteban, I could just make it so you can configure the 0.8 upper bound with how to change the limit in the message we throw when we fail to start. WARNs are not always seen. Guessing allowed minimum is always going to be unsatisfactory for some one (See Jimmy Lin running hbase on raspberry pi, see this user I've been messing with running 110G heaps) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Summary: Improving performance of HeapMemoryManager (was: Improving performance of HeapMemoryTunerManager) Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-13875: --- Attachment: 13875-branch-1.txt Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. with
[jira] [Commented] (HBASE-13875) Clock skew between master and region server may render restored region without server address
[ https://issues.apache.org/jira/browse/HBASE-13875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579576#comment-14579576 ] Ted Yu commented on HBASE-13875: Removes redundant Put in MetaTableAccessor#addRegionsToMeta(). Clock skew between master and region server may render restored region without server address - Key: HBASE-13875 URL: https://issues.apache.org/jira/browse/HBASE-13875 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: 13875-branch-1.txt We observed the following issue in cluster testing on a restored table (table_gwbh9rxyz3). {code} 2015-06-08 14:29:47,313|beaver.component.hbase|INFO|6196|140144585275136|MainThread| 'get 'table_gwbh9rxyz3','row1', {COLUMN = 'family1'}' ... 2015-06-08 14:31:38,203|beaver.machine|INFO|6196|140144585275136|MainThread|ERROR: No server address listed in hbase:meta for region table_gwbh9rxyz3,,1433773371699. 48652273628a291653d8c43aaa02179a. containing row row1 {code} Here was related log snippet from master - part for RestoreSnapshotHandler#handleTableOperation(): {code} 2015-06-08 14:28:41,968 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: starting restore 2015-06-08 14:28:41,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: get table regions: hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3 2015-06-08 14:28:41,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: found 1 regions for table=table_gwbh9rxyz3 2015-06-08 14:28:41,984 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] snapshot.RestoreSnapshotHelper: region to restore: 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:42,001 DEBUG [RestoreSnapshot-pool584-t1] backup.HFileArchiver: Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66, to hdfs://ip-172-31-46-239.ec2.internal:8020/user/hbase/.slider/cluster/hbasesliderapp/database/archive/data/default/table_gwbh9rxyz3/48652273628a291653d8c43aaa02179a/family1/45aa3fb9e0404814b77a9cac91ebeb66 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [] 2015-06-08 14:28:42,002 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 0 2015-06-08 14:28:42,014 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Deleted [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Verify snapshot=table_gwbh9rxyz3-ru-20150608 against=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 2015-06-08 14:28:42,022 DEBUG [B.defaultRpcServer.handler=13,queue=1,port=54936] snapshot.SnapshotManager: Sentinel is not yet finished with restoring snapshot={ ss=table_gwbh9rxyz3-ru-20150608 table=table_gwbh9rxyz3 type=FLUSH } 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Added 2 2015-06-08 14:28:42,038 INFO [MASTER_TABLE_OPERATIONS-ip-172-31-46-243:54936-0] hbase.MetaTableAccessor: Overwritten [{ENCODED = 48652273628a291653d8c43aaa02179a, NAME = 'table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a.', STARTKEY = '', ENDKEY = ''}] {code} Here was log snippet from region server - corresponding to table being enabled after snapshot restore: {code} 2015-06-08 14:28:41,914 DEBUG [RS_OPEN_REGION-ip-172-31-46-239:51852-2] zookeeper.ZKAssign: regionserver:51852-0x24dd2833c34000b, quorum=ip-172-31-46-239.ec2.internal:2181,ip-172-31-46-241.ec2.internal:2181,ip-172-31-46-242.ec2.internal:2181, baseZNode=/services/slider/users/hbase/hbasesliderapp Attempting to retransition opening state of node 48652273628a291653d8c43aaa02179a 2015-06-08 14:28:41,916 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] regionserver.HRegionServer: Post open deploy tasks for table_gwbh9rxyz3,,1433773371699.48652273628a291653d8c43aaa02179a. 2015-06-08 14:28:41,920 INFO [PostOpenDeployTasks:48652273628a291653d8c43aaa02179a] hbase.MetaTableAccessor: Updated row
[jira] [Updated] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Esteban Gutierrez updated HBASE-13874: -- Attachment: 0001-HBASE-13874-Fix-0.8-being-hardcoded-sum-of-blockcach.patch Hey [~stack] let me know if this is what you are thinking about on this initially. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Assignee: Esteban Gutierrez Attachments: 0001-HBASE-13874-Fix-0.8-being-hardcoded-sum-of-blockcach.patch Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579710#comment-14579710 ] Vladimir Rodionov commented on HBASE-13874: --- [~saint@gmail.com] asked: {quote} Vladimir Rodionov you think 1.5g is too low? {quote} Yes, we have observed OOME with heaps below 8GB while running M/R jobs(1.5-2GB reserved for HBase), with standard settings for block cache and memstore in the past (pre-0.98). Can't say for 0.98+ since heaps are larger now :). Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Assignee: Esteban Gutierrez Attachments: 0001-HBASE-13874-Fix-0.8-being-hardcoded-sum-of-blockcach.patch Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13874) Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap
[ https://issues.apache.org/jira/browse/HBASE-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579714#comment-14579714 ] Vladimir Rodionov commented on HBASE-13874: --- If block cache + memstore exceed ( 1- hbase.regionserver.reserved.memory.threshold) we throw RuntimeException and we can control reserved memory ratio. Looks right to me. Fix 0.8 being hardcoded sum of blockcache + memstore; doesn't make sense when big heap -- Key: HBASE-13874 URL: https://issues.apache.org/jira/browse/HBASE-13874 Project: HBase Issue Type: Task Reporter: stack Assignee: Esteban Gutierrez Attachments: 0001-HBASE-13874-Fix-0.8-being-hardcoded-sum-of-blockcach.patch Fix this in HBaseConfiguration: {code} 79 private static void checkForClusterFreeMemoryLimit(Configuration conf) { 80 float globalMemstoreLimit = conf.getFloat(hbase.regionserver.global.memstore.upperLimit, 0.4f); 81 int gml = (int)(globalMemstoreLimit * CONVERT_TO_PERCENTAGE); 82 float blockCacheUpperLimit = 83 conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, 84 HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT); 85 int bcul = (int)(blockCacheUpperLimit * CONVERT_TO_PERCENTAGE); 86 if (CONVERT_TO_PERCENTAGE - (gml + bcul) 87(int)(CONVERT_TO_PERCENTAGE * 88 HConstants.HBASE_CLUSTER_MINIMUM_MEMORY_THRESHOLD)) { 89 throw new RuntimeException( 90 Current heap configuration for MemStore and BlockCache exceeds + 91 the threshold required for successful cluster operation. + 92 The combined value cannot exceed 0.8. Please check + 93 the settings for hbase.regionserver.global.memstore.upperLimit and + 94 hfile.block.cache.size in your configuration. + 95 hbase.regionserver.global.memstore.upperLimit is + 96 globalMemstoreLimit + 97 hfile.block.cache.size is + blockCacheUpperLimit); 98 } 99 } {code} Hardcoding 0.8 doesn't make much sense in a heap of 100G+ (that is 20G over for hbase itself -- more than enough). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-13877: -- Status: Patch Available (was: Open) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated HBASE-13877: -- Attachment: hbase-13877_v1.patch How about this? Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which are very rare. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then we will be having a lot of cache misses. But even after increasing block cache we wont be able to decrease number of cache misses and we will revert back and eventually we will not waste memory on block caches. This will also help us ignore short term random spikes in reads / writes. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which is very rare to happen. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then even after increasing block cache we wont be able to decrease number of cache misses and eventually we will not waste memory on block caches. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which are very rare. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then we will be having a lot of cache misses. But even after increasing block cache we wont be able to decrease number of cache misses and we will revert back and eventually we will not waste memory on block caches. This will also help us ignore short term random spikes in reads / writes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13876) Improving performance of HeapMemoryManager
[ https://issues.apache.org/jira/browse/HBASE-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhilash updated HBASE-13876: - Description: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which are very rare. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then we will be having a lot of cache misses. But even after increasing block cache we wont be able to decrease number of cache misses and we will revert back and eventually we will not waste memory on block caches. This will also help us ignore random short term spikes in reads / writes. was: I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which are very rare. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then we will be having a lot of cache misses. But even after increasing block cache we wont be able to decrease number of cache misses and we will revert back and eventually we will not waste memory on block caches. This will also help us ignore short term random spikes in reads / writes. Improving performance of HeapMemoryManager -- Key: HBASE-13876 URL: https://issues.apache.org/jira/browse/HBASE-13876 Project: HBase Issue Type: Improvement Components: hbase, regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 1.1.1 Reporter: Abhilash Assignee: Abhilash Priority: Minor Attachments: HBASE-13876.patch I am trying to improve the performance of DefaultHeapMemoryTuner by introducing some more checks. The current checks under which the DefaultHeapMemoryTuner works are very rare so I am trying to weaken these checks to improve its performance. Check current memstore size and current block cache size. If we are using less than 50% of currently available block cache size we say block cache is sufficient and same for memstore. This check will be very effective when server is either load heavy or write heavy. Earlier version just waited for number of evictions / number of flushes to be zero which are very rare. Otherwise based on percent change in number of cache misses and number of flushes we increase / decrease memory provided for caching / memstore. After doing so, on next call of HeapMemoryTuner we verify that last change has indeed decreased number of evictions / flush ( combined). I am doing this analysis by comparing percent change (which is basically nothing but normalized derivative) of number of evictions and number of flushes during last two periods. The main motive for doing this was that if we have random reads then we will be having a lot of cache misses. But even after increasing block cache we wont be able to decrease number of cache misses and we will revert back and eventually we will not waste memory on block caches. This will
[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579723#comment-14579723 ] Enis Soztutar commented on HBASE-13877: --- [~jinghe] mind taking a look? I'll run ITBLL with this in the mean time. Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)