[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118247#comment-13118247 ] Ted Yu commented on HBASE-2794: --- TestServerCustomProtocol#testRowRange failed during test suite run but passed standalone. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Assignee: Mikhail Bautin Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117517#comment-13117517 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/#review2161 --- src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2084/#comment5035 I was implying that this is also a method argument when I wrote this comment. I will edit this to make it clearer. src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java https://reviews.apache.org/r/2084/#comment5036 Yes, I will modify the javadoc of this method. - Mikhail On 2011-09-28 16:03:52, Mikhail Bautin wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2084/ bq. --- bq. bq. (Updated 2011-09-28 16:03:52) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. bq. bq. bq. This addresses bug HBASE-2794. bq. https://issues.apache.org/jira/browse/HBASE-2794 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 bq. src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/2084/diff bq. bq. bq. Testing bq. --- bq. bq. Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. bq. bq. bq. Thanks, bq. bq. Mikhail bq. bq. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117523#comment-13117523 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- bq. On 2011-09-28 17:42:46, Ted Yu wrote: bq. This is an important feature. bq. bq. Since the boolean parameter, forward, correlates so closely with reseek, can we give it a better name ? bq. I was thinking about either reseek or forwardOnly. We have a few diffs in the pipeline that depend on this one. Can we rename the boolean flag after we commit those diffs? - Mikhail --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/#review2137 --- On 2011-09-28 16:03:52, Mikhail Bautin wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2084/ bq. --- bq. bq. (Updated 2011-09-28 16:03:52) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. bq. bq. bq. This addresses bug HBASE-2794. bq. https://issues.apache.org/jira/browse/HBASE-2794 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 bq. src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/2084/diff bq. bq. bq. Testing bq. --- bq. bq. Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. bq. bq. bq. Thanks, bq. bq. Mikhail bq. bq. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117524#comment-13117524 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- bq. On 2011-09-28 17:42:46, Ted Yu wrote: bq. This is an important feature. bq. bq. Since the boolean parameter, forward, correlates so closely with reseek, can we give it a better name ? bq. I was thinking about either reseek or forwardOnly. bq. bq. Mikhail Bautin wrote: bq. We have a few diffs in the pipeline that depend on this one. Can we rename the boolean flag after we commit those diffs? I am fine with the current name of forward. - Ted --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/#review2137 --- On 2011-09-28 16:03:52, Mikhail Bautin wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2084/ bq. --- bq. bq. (Updated 2011-09-28 16:03:52) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. bq. bq. bq. This addresses bug HBASE-2794. bq. https://issues.apache.org/jira/browse/HBASE-2794 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 bq. src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/2084/diff bq. bq. bq. Testing bq. --- bq. bq. Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. bq. bq. bq. Thanks, bq. bq. Mikhail bq. bq. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117615#comment-13117615 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/ --- (Updated 2011-09-29 21:05:20.334849) Review request for hbase. Changes --- Addressing Jonathan's comments. Summary --- Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. This addresses bug HBASE-2794. https://issues.apache.org/jira/browse/HBASE-2794 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION Diff: https://reviews.apache.org/r/2084/diff Testing --- Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. Thanks, Mikhail ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116569#comment-13116569 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/ --- Review request for hbase. Summary --- Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. This addresses bug HBASE-2794. https://issues.apache.org/jira/browse/HBASE-2794 Diffs - src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION Diff: https://reviews.apache.org/r/2084/diff Testing --- Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. Thanks, Mikhail ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116578#comment-13116578 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/#review2130 --- nice work mikhail! i will let someone else give the +1 though src/main/java/org/apache/hadoop/hbase/KeyValue.java https://reviews.apache.org/r/2084/#comment4946 method doesn't actually take a KeyValue... this is to create the last KV the on row and column for the KeyValue this is called on? src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java https://reviews.apache.org/r/2084/#comment4947 got it. maybe add a comment on this method to explain this usage src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java https://reviews.apache.org/r/2084/#comment4948 license - Jonathan On 2011-09-28 16:03:52, Mikhail Bautin wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2084/ bq. --- bq. bq. (Updated 2011-09-28 16:03:52) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. bq. bq. bq. This addresses bug HBASE-2794. bq. https://issues.apache.org/jira/browse/HBASE-2794 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 bq. src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/2084/diff bq. bq. bq. Testing bq. --- bq. bq. Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. bq. bq. bq. Thanks, bq. bq. Mikhail bq. bq. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116628#comment-13116628 ] jirapos...@reviews.apache.org commented on HBASE-2794: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2084/#review2137 --- This is an important feature. Since the boolean parameter, forward, correlates so closely with reseek, can we give it a better name ? I was thinking about either reseek or forwardOnly. - Ted On 2011-09-28 16:03:52, Mikhail Bautin wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2084/ bq. --- bq. bq. (Updated 2011-09-28 16:03:52) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. Previously we only used row-column Bloom filters for scans that only requested one column. We have seen production queries that request up to 200 columns, and with say ~6 store files per store (region / column family combination) this might have resulted in 1200 block read operations in the worst case. With this diff we will be avoiding seeks on store files that we know don't contain the row/column of interest when using an ExplicitColumnTracker. The performance should remain the same for column range queries. bq. bq. bq. This addresses bug HBASE-2794. bq. https://issues.apache.org/jira/browse/HBASE-2794 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 08d3ba4 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 68cdac5 bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 6cdada7 bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 bq. src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java f5173c4 bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 32f88fb bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java a5d13f7 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java baee696 bq. src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java PRE-CREATION bq. bq. Diff: https://reviews.apache.org/r/2084/diff bq. bq. bq. Testing bq. --- bq. bq. Existing unit tests. A new unit test (TestScanWithBloomError). Load testing using HBaseTest. bq. bq. bq. Thanks, bq. bq. Mikhail bq. bq. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116689#comment-13116689 ] Ted Yu commented on HBASE-2794: --- I got the following errors from test suite: {code} Failed tests: testWorkerAbort(org.apache.hadoop.hbase.master.TestDistributedLogSplitting): expected:1 but was:0 Tests in error: testMergeTool(org.apache.hadoop.hbase.util.TestMergeTool): String index out of range: -1 testBasicRollingRestart(org.apache.hadoop.hbase.master.TestRollingRestart): test timed out after 30 milliseconds {code} They passed individually. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Components: performance Reporter: Kannan Muthukkaruppan Fix For: 0.92.0 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888109#action_12888109 ] HBase Review Board commented on HBASE-2794: --- Message from: Kris Jirapinyo kjirapi...@attensity.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/ --- (Updated 2010-07-13 16:32:18.729301) Review request for hbase. Changes --- Added changes to code after HBASE-2265 was committed. Also, incorporated suggestion from Nicolas to not lookup when columns.size*error.rate 10%. Changed BloomFilter interface, adding getErrorRate(). ByteBloomFilter now also has errorRate stored. Summary --- HBASE-2794 Enable bloom filter checks for multiple columns in same column family This addresses bug HBASE-2794. http://issues.apache.org/jira/browse/HBASE-2794 Diffs (updated) - /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 963862 /trunk/src/main/java/org/apache/hadoop/hbase/util/BloomFilter.java 963873 /trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBloomFilter.java 963873 /trunk/src/main/java/org/apache/hadoop/hbase/util/DynamicByteBloomFilter.java 963873 /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 963873 Diff: http://review.hbase.org/r/296/diff Testing --- Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple times. Ran and passed all tests when building. Thanks, Kris ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888142#action_12888142 ] Nicolas Spiegelberg commented on HBASE-2794: Talked with Kris about setting proper exit conditions. #1 : Exit if our error.rate 10%. This is an arbitrary number. Could easily make this configurable if someone needs it #2 : Exit if it would take 1ms to run the bloom check. This ensures that blooms are beneficial for performance even if they aren't needed 90% of the time I wonder if it would be good to give the user an option of not running a bloom check if only 1 HFile in the StoreFile, but that's for another JIRA. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888144#action_12888144 ] HBase Review Board commented on HBASE-2794: --- Message from: Nicolas nspiegelb...@facebook.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review397 --- Looking good! Waiting for performance test numbers on StoreFile.shouldSeek(). I think we want to early exit if shouldSeek() would take 1ms or something sensible. /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java http://review.hbase.org/r/296/#comment1703 red = using tabs instead of spaces or trailing spaces. quick fix might be nice (or is this auto-handled by svn, Stack?) /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java http://review.hbase.org/r/296/#comment1702 could you add test header comments so we know all the cases you're trying to test? - Nicolas ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888166#action_12888166 ] HBase Review Board commented on HBASE-2794: --- Message from: Jonathan Gray jg...@apache.org bq. On 2010-07-13 18:09:13, Nicolas wrote: bq. /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 958 bq. http://review.hbase.org/r/296/diff/3/?file=2723#file2723line958 bq. bq. red = using tabs instead of spaces or trailing spaces. quick fix might be nice (or is this auto-handled by svn, Stack?) none of this is auto-handled by svn. need to setup eclipse or whatever you use to use 2 spaces instead of tabs. and in eclipse, i have my code cleanup set to remove whitespace and run that periodically. - Jonathan --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review397 --- ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887277#action_12887277 ] HBase Review Board commented on HBASE-2794: --- Message from: Kris Jirapinyo kjirapi...@attensity.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/ --- Review request for hbase. Summary --- HBASE-2794 Enable bloom filter checks for multiple columns in same column family This addresses bug HBASE-2794. http://issues.apache.org/jira/browse/HBASE-2794 Diffs - /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 962748 /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 962748 Diff: http://review.hbase.org/r/296/diff Testing --- Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple times. Ran and passed all tests when building. Thanks, Kris ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887433#action_12887433 ] HBase Review Board commented on HBASE-2794: --- Message from: Nicolas nspiegelb...@facebook.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review350 --- /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java http://review.hbase.org/r/296/#comment1468 have you done any tests to see when the number of bloom checks takes significant time compared to just getting the block? For example, if you have 100 columns to lookup, do bloom filters really buy you anything, or shouldn't you just switch to a Row-level bloom anyways? Also, with a default 1% error rate, you're looking at ~100% false positive with 100 columns. Maybe max.columns = sqrt(1/error.rate) /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java http://review.hbase.org/r/296/#comment1463 probably should pre-allocate the ArrayList() size so we only deal with one heap element. - Nicolas ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887443#action_12887443 ] HBase Review Board commented on HBASE-2794: --- Message from: Kris Jirapinyo kjirapi...@attensity.com bq. On 2010-07-12 10:17:25, Nicolas wrote: bq. /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 860 bq. http://review.hbase.org/r/296/diff/1/?file=2378#file2378line860 bq. bq. probably should pre-allocate the ArrayList() size so we only deal with one heap element. Good idea. bq. On 2010-07-12 10:17:25, Nicolas wrote: bq. /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 857 bq. http://review.hbase.org/r/296/diff/1/?file=2378#file2378line857 bq. bq. have you done any tests to see when the number of bloom checks takes significant time compared to just getting the block? For example, if you have 100 columns to lookup, do bloom filters really buy you anything, or shouldn't you just switch to a Row-level bloom anyways? Also, with a default 1% error rate, you're looking at ~100% false positive with 100 columns. Maybe max.columns = sqrt(1/error.rate) I have not, but would running on just the test data be sufficent to tell the true savings since the tests just run on mock data? I don't really have a dev cluster with real data that I can test this on, so perhaps you or someone could help out in that regard. - Kris --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review350 --- ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887501#action_12887501 ] HBase Review Board commented on HBASE-2794: --- Message from: Kannan Muthukkaruppan kan...@facebook.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review361 --- /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java http://review.hbase.org/r/296/#comment1497 can't this loop be over columns itself? And then inside the loop, you prepare one key at a time use Bytes.add(row, col). That way, you can avoid the keyList data structure completely. - Kannan ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887507#action_12887507 ] HBase Review Board commented on HBASE-2794: --- Message from: Kris Jirapinyo kjirapi...@attensity.com bq. On 2010-07-12 13:14:32, Kannan Muthukkaruppan wrote: bq. /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 880 bq. http://review.hbase.org/r/296/diff/1/?file=2378#file2378line880 bq. bq. can't this loop be over columns itself? And then inside the loop, you prepare one key at a time use Bytes.add(row, col). That way, you can avoid the keyList data structure completely. Another good idea :) Will also get rid of the warning that keyList could possibly be null. - Kris --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review361 --- ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887511#action_12887511 ] ryan rawson commented on HBASE-2794: Consider a table with 12 billion rows. At 9 bits/row, we are looking at 135 bytes of ram (base) to store the blooms in ram. That is 12.57 GB ram to store the blooms. The memory competes with the block cache, thus you are losing 12.57 GB ram that could be used to cache blocks. If your data is in block cache, seeking is free, thus there is an essential trade off here. In my case, the 12b rows are small ones, and thus we have a lot of rows for the actual data size. On a different dataset, the row count might be smaller for a the actual data size and it might be worthwhile. Furthermore, blooms don't work on Scans and only Gets. The key takeaway here is that (a) bloom filters are not free and potentially very expensive in terms of RAM, (b) bloom data competes with the block cache, and (c) the trade off depends on the data set and access patterns. On Mon, Jul 12, 2010 at 12:07 PM, HBase Review Board (JIRA) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887537#action_12887537 ] Nicolas Spiegelberg commented on HBASE-2794: IRC conversation about this... krispyjala: nspiegelberg: but is the test you want related to HBASE-2794 or just using bloom filter in general (e.g. when to use it and when not to)? [1:41pm] nspiegelberg: it's related to 2794 [1:42pm] nspiegelberg: an easy example of why you need good measurements is the case of calling bloom.contains() for 100 row+col in a 1% false positive bloom. You are getting almost 100% false positives then, so the bloom is an obvious perf drop [1:43pm] krispyjala: nspiegelberg: ok i think i understand [1:44pm] krispyjala: nspiegelberg: but wait 100% false positive? [1:46pm] nspiegelberg: right, so io.hfile.bloom.error.rate == .01 by default. so 1% [1:46pm] krispyjala: ok [1:46pm] krispyjala: how does that add up to 100% for 100 lookups? [1:46pm] nspiegelberg: therefore, if you call bloom.contains() 5 times and OR the result, the false positive rate is 5% [1:49pm] nspiegelberg: krispyjala: so a simple example. call bloom.contains() 10 times = 10% error rate = (10ms/seek * 10%) + time(bloom.contains) [1:50pm] krispyjala: nspiegelberg: but is it really OR'ing all of them? In the code if even one column lookup returns true we return true and don't look up any other columns [1:51pm] nspiegelberg: right, that's the same thing as ORing them [1:51pm] nspiegelberg: logical OR = || [1:52pm] krispyjala: nspiegelberg: but the point is we're probably not looking up 100 columns every time for that operation, even theoretically yes we do a logical OR [1:52pm] krispyjala: if we hit true on the 5th column, we quit the loop and return right away [1:53pm] nspiegelberg: the only way you win with blooms is if all bloom.contains() return false and you don't have to do the lookup [1:53pm] krispyjala: yes [1:53pm] nspiegelberg: so, you're right, we do an average of 50 lookups per false positive in this case. [1:54pm] nspiegelberg: I'm just saying, what is the cost of those 50 lookups? If 1ms, then every HFile seek costs 11ms with blooms enabled versus 10 ms without using them [1:55pm] krispyjala: but wait i thought the code was to determine whether to add the StoreScanner to the list or not...or are you saying then that the point is in the case of 100 columns we should just not even bother doing bloom multicolumn check because perhaps it's better to just load it than wasting time with the 100 lookups (potentially) [1:55pm] nspiegelberg: exactly [1:55pm] krispyjala: nspiegelberg: lol ok got it [1:56pm] krispyjala: but realistically, who does gets on 100 columns? I don't know the HBase internals well yet (that's why i picked the noob ticket lol)...wouldn't it be better to just do a get on the row? [1:57pm] nspiegelberg: never under-estimate the naivete of users [1:57pm] krispyjala: nspiegelberg: sigh lol, i guess that's why the bloom is off by default? [1:58pm] nspiegelberg: yes [1:58pm] nspiegelberg: so, it's obvious that you never want to run bloom code with 101 columns + 1% error rate [1:58pm] krispyjala: correct [1:59pm] nspiegelberg: so, really it's just timing testBloomPerf with various lookup counts on various size blooms [2:00pm] krispyjala: nspiegelberg: this talk has helped me think about how to test like you said [2:00pm] • St^Ack hopes the above good-stuff(tm) 'lesson' makes it back into the issue [2:00pm] nspiegelberg: looks like ryan didn't give you any concrete numbers, so you might have to just start with some assumptions (like, don't use blooms if avg key 1KB) and run with that [2:01pm] krispyjala: nspiegelberg: and perhaps once we kind of know where the tradeoff is, would it be wrong to limit in the code saying if there are more than say 10 column lookups might as well just return true? [2:01pm] krispyjala: cuz it's not worth looking up in bloom at that point [2:01pm] nspiegelberg: I think that's exactly what we need to do [2:01pm] krispyjala: whatever the threshold is [2:02pm] nspiegelberg: if we pretend that the cost of bloom.contains() == 0, then maybe we want to say if (column.count * error.rate 10%) return true; [2:02pm] dj_ryan: well it's hard to say where the tradeoff goes [2:02pm] krispyjala: pastebin? lol jk [2:02pm] dj_ryan: but the hard number is 9 bits/item [2:03pm] dj_ryan: you can then calculate how much ram you are spending on blooms [2:03pm] dj_ryan: and decide if its worth it [2:03pm] nspiegelberg: the hard # for 1% error rate blooms is 9 bits/item [2:03pm] dj_ryan: we never implemented blooms because it seemed 12gb of ram would be better off caching [2:03pm] krispyjala: dj_ryan: so your suggestion the onus is on the user and not hbase code [2:03pm] nspiegelberg: with .1% error rate, it's ~12 bits/item [2:04pm] krispyjala:
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887638#action_12887638 ] HBase Review Board commented on HBASE-2794: --- Message from: Kris Jirapinyo kjirapi...@attensity.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/ --- (Updated 2010-07-12 19:48:43.373418) Review request for hbase. Changes --- Implemented Kannan's suggestion, thereby removing keyList. Summary --- HBASE-2794 Enable bloom filter checks for multiple columns in same column family This addresses bug HBASE-2794. http://issues.apache.org/jira/browse/HBASE-2794 Diffs (updated) - /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 962748 /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 962748 Diff: http://review.hbase.org/r/296/diff Testing --- Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple times. Ran and passed all tests when building. Thanks, Kris ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887659#action_12887659 ] HBase Review Board commented on HBASE-2794: --- Message from: Kannan Muthukkaruppan kan...@facebook.com --- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/296/#review384 --- One inlined comment. Otherwise, the patch and the test look good. /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java http://review.hbase.org/r/296/#comment1622 Once Pranav's patch for HBase-2265 lands, the shouldSeek() API will take a Scan as the first argument instead of the row. So, you might need to rebase the test with respect to that patch. - Kannan ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887274#action_12887274 ] ryan rawson commented on HBASE-2794: can you also upload it to review.hbase.org for easy reviewing, thanks :-) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Attachments: 2794_multi_column_check.txt Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882820#action_12882820 ] Kannan Muthukkaruppan commented on HBASE-2794: -- Perhaps a simple starter task for someone interested. ROWCOL bloom filter not used if multiple columns within same family are requested in a Get -- Key: HBASE-2794 URL: https://issues.apache.org/jira/browse/HBASE-2794 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): {code} switch(bloomFilterType) { case ROW: key = row; break; case ROWCOL: if (columns.size() == 1) { byte[] col = columns.first(); key = Bytes.add(row, col); break; } //$FALL-THROUGH$ default: return true; } {code} If columns.size 1, then we currently don't take advantage of the bloom filter. We should optimize this to check bloom for each of columns and if none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.