[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-30 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118247#comment-13118247
 ] 

Ted Yu commented on HBASE-2794:
---

TestServerCustomProtocol#testRowRange failed during test suite run but passed 
standalone.

 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
Assignee: Mikhail Bautin
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117517#comment-13117517
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/#review2161
---



src/main/java/org/apache/hadoop/hbase/KeyValue.java
https://reviews.apache.org/r/2084/#comment5035

I was implying that this is also a method argument when I wrote this 
comment. I will edit this to make it clearer.



src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
https://reviews.apache.org/r/2084/#comment5036

Yes, I will modify the javadoc of this method.


- Mikhail


On 2011-09-28 16:03:52, Mikhail Bautin wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2084/
bq.  ---
bq.  
bq.  (Updated 2011-09-28 16:03:52)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Previously we only used row-column Bloom filters for scans that only 
requested one column. We have seen production queries that request up to 200 
columns, and with say ~6 store files per store (region / column family 
combination) this might have resulted in 1200 block read operations in the 
worst case. With this diff we will be avoiding seeks on store files that we 
know don't contain the row/column of interest when using an 
ExplicitColumnTracker. The performance should remain the same for column range 
queries.
bq.  
bq.  
bq.  This addresses bug HBASE-2794.
bq.  https://issues.apache.org/jira/browse/HBASE-2794
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 
ac2348e 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
fd9e7ef 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 
9d9895c 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2084/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Existing unit tests. A new unit test (TestScanWithBloomError). Load 
testing using HBaseTest.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117523#comment-13117523
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--



bq.  On 2011-09-28 17:42:46, Ted Yu wrote:
bq.   This is an important feature.
bq.   
bq.   Since the boolean parameter, forward, correlates so closely with reseek, 
can we give it a better name ?
bq.   I was thinking about either reseek or forwardOnly.

We have a few diffs in the pipeline that depend on this one. Can we rename the 
boolean flag after we commit those diffs?


- Mikhail


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/#review2137
---


On 2011-09-28 16:03:52, Mikhail Bautin wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2084/
bq.  ---
bq.  
bq.  (Updated 2011-09-28 16:03:52)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Previously we only used row-column Bloom filters for scans that only 
requested one column. We have seen production queries that request up to 200 
columns, and with say ~6 store files per store (region / column family 
combination) this might have resulted in 1200 block read operations in the 
worst case. With this diff we will be avoiding seeks on store files that we 
know don't contain the row/column of interest when using an 
ExplicitColumnTracker. The performance should remain the same for column range 
queries.
bq.  
bq.  
bq.  This addresses bug HBASE-2794.
bq.  https://issues.apache.org/jira/browse/HBASE-2794
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 
ac2348e 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
fd9e7ef 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 
9d9895c 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2084/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Existing unit tests. A new unit test (TestScanWithBloomError). Load 
testing using HBaseTest.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117524#comment-13117524
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--



bq.  On 2011-09-28 17:42:46, Ted Yu wrote:
bq.   This is an important feature.
bq.   
bq.   Since the boolean parameter, forward, correlates so closely with reseek, 
can we give it a better name ?
bq.   I was thinking about either reseek or forwardOnly.
bq.  
bq.  Mikhail Bautin wrote:
bq.  We have a few diffs in the pipeline that depend on this one. Can we 
rename the boolean flag after we commit those diffs?

I am fine with the current name of forward.


- Ted


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/#review2137
---


On 2011-09-28 16:03:52, Mikhail Bautin wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2084/
bq.  ---
bq.  
bq.  (Updated 2011-09-28 16:03:52)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Previously we only used row-column Bloom filters for scans that only 
requested one column. We have seen production queries that request up to 200 
columns, and with say ~6 store files per store (region / column family 
combination) this might have resulted in 1200 block read operations in the 
worst case. With this diff we will be avoiding seeks on store files that we 
know don't contain the row/column of interest when using an 
ExplicitColumnTracker. The performance should remain the same for column range 
queries.
bq.  
bq.  
bq.  This addresses bug HBASE-2794.
bq.  https://issues.apache.org/jira/browse/HBASE-2794
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 
ac2348e 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
fd9e7ef 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 
9d9895c 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2084/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Existing unit tests. A new unit test (TestScanWithBloomError). Load 
testing using HBaseTest.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact 

[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117615#comment-13117615
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/
---

(Updated 2011-09-29 21:05:20.334849)


Review request for hbase.


Changes
---

Addressing Jonathan's comments.


Summary
---

Previously we only used row-column Bloom filters for scans that only requested 
one column. We have seen production queries that request up to 200 columns, and 
with say ~6 store files per store (region / column family combination) this 
might have resulted in 1200 block read operations in the worst case. With this 
diff we will be avoiding seeks on store files that we know don't contain the 
row/column of interest when using an ExplicitColumnTracker. The performance 
should remain the same for column range queries.


This addresses bug HBASE-2794.
https://issues.apache.org/jira/browse/HBASE-2794


Diffs (updated)
-

  src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
  src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
  
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
  src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
  src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e 
  src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
  
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
  
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/2084/diff


Testing
---

Existing unit tests. A new unit test (TestScanWithBloomError). Load testing 
using HBaseTest.


Thanks,

Mikhail



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-28 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116569#comment-13116569
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/
---

Review request for hbase.


Summary
---

Previously we only used row-column Bloom filters for scans that only requested 
one column. We have seen production queries that request up to 200 columns, and 
with say ~6 store files per store (region / column family combination) this 
might have resulted in 1200 block read operations in the worst case. With this 
diff we will be avoiding seeks on store files that we know don't contain the 
row/column of interest when using an ExplicitColumnTracker. The performance 
should remain the same for column range queries.


This addresses bug HBASE-2794.
https://issues.apache.org/jira/browse/HBASE-2794


Diffs
-

  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java ac2348e 
  src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
  src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
  src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java fd9e7ef 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 9d9895c 
  src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
  
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
  src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
  src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
  src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
  
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
  
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/2084/diff


Testing
---

Existing unit tests. A new unit test (TestScanWithBloomError). Load testing 
using HBaseTest.


Thanks,

Mikhail



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-28 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116578#comment-13116578
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/#review2130
---


nice work mikhail!  i will let someone else give the +1 though


src/main/java/org/apache/hadoop/hbase/KeyValue.java
https://reviews.apache.org/r/2084/#comment4946

method doesn't actually take a KeyValue... this is to create the last KV 
the on row and column for the KeyValue this is called on?



src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
https://reviews.apache.org/r/2084/#comment4947

got it.  maybe add a comment on this method to explain this usage



src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java
https://reviews.apache.org/r/2084/#comment4948

license


- Jonathan


On 2011-09-28 16:03:52, Mikhail Bautin wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2084/
bq.  ---
bq.  
bq.  (Updated 2011-09-28 16:03:52)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Previously we only used row-column Bloom filters for scans that only 
requested one column. We have seen production queries that request up to 200 
columns, and with say ~6 store files per store (region / column family 
combination) this might have resulted in 1200 block read operations in the 
worst case. With this diff we will be avoiding seeks on store files that we 
know don't contain the row/column of interest when using an 
ExplicitColumnTracker. The performance should remain the same for column range 
queries.
bq.  
bq.  
bq.  This addresses bug HBASE-2794.
bq.  https://issues.apache.org/jira/browse/HBASE-2794
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 
ac2348e 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
fd9e7ef 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 
9d9895c 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2084/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Existing unit tests. A new unit test (TestScanWithBloomError). Load 
testing using HBaseTest.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are 

[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-28 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116628#comment-13116628
 ] 

jirapos...@reviews.apache.org commented on HBASE-2794:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2084/#review2137
---


This is an important feature.

Since the boolean parameter, forward, correlates so closely with reseek, can we 
give it a better name ?
I was thinking about either reseek or forwardOnly.

- Ted


On 2011-09-28 16:03:52, Mikhail Bautin wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2084/
bq.  ---
bq.  
bq.  (Updated 2011-09-28 16:03:52)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Previously we only used row-column Bloom filters for scans that only 
requested one column. We have seen production queries that request up to 200 
columns, and with say ~6 store files per store (region / column family 
combination) this might have resulted in 1200 block read operations in the 
worst case. With this diff we will be avoiding seeks on store files that we 
know don't contain the row/column of interest when using an 
ExplicitColumnTracker. The performance should remain the same for column range 
queries.
bq.  
bq.  
bq.  This addresses bug HBASE-2794.
bq.  https://issues.apache.org/jira/browse/HBASE-2794
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java 
08d3ba4 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 
ac2348e 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java 4aa72de 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 
68cdac5 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
fd9e7ef 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java 
9d9895c 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueScanner.java 
6cdada7 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
bq.
src/main/java/org/apache/hadoop/hbase/regionserver/AbstractKeyValueScanner.java 
PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/KeyValue.java 585c4a8 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java 
f5173c4 
bq.src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java a3d778e 
bq.src/main/java/org/apache/hadoop/hbase/util/CollectionBackedScanner.java 
32f88fb 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestKeyValueHeap.java 
a5d13f7 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestMultiColumnScanner.java 
baee696 
bq.
src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2084/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Existing unit tests. A new unit test (TestScanWithBloomError). Load 
testing using HBaseTest.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Mikhail
bq.  
bq.



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2011-09-28 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116689#comment-13116689
 ] 

Ted Yu commented on HBASE-2794:
---

I got the following errors from test suite:
{code}
Failed tests:   
testWorkerAbort(org.apache.hadoop.hbase.master.TestDistributedLogSplitting): 
expected:1 but was:0

Tests in error:
  testMergeTool(org.apache.hadoop.hbase.util.TestMergeTool): String index out 
of range: -1
  testBasicRollingRestart(org.apache.hadoop.hbase.master.TestRollingRestart): 
test timed out after 30 milliseconds
{code}
They passed individually.

 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
  Components: performance
Reporter: Kannan Muthukkaruppan
 Fix For: 0.92.0


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-13 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888109#action_12888109
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kris Jirapinyo kjirapi...@attensity.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/
---

(Updated 2010-07-13 16:32:18.729301)


Review request for hbase.


Changes
---

Added changes to code after HBASE-2265 was committed.

Also, incorporated suggestion from Nicolas to not lookup when 
columns.size*error.rate  10%.

Changed BloomFilter interface, adding getErrorRate().  ByteBloomFilter now also 
has errorRate stored.


Summary
---

HBASE-2794 Enable bloom filter checks for multiple columns in same column family


This addresses bug HBASE-2794.
http://issues.apache.org/jira/browse/HBASE-2794


Diffs (updated)
-

  /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
963862 
  /trunk/src/main/java/org/apache/hadoop/hbase/util/BloomFilter.java 963873 
  /trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBloomFilter.java 963873 
  /trunk/src/main/java/org/apache/hadoop/hbase/util/DynamicByteBloomFilter.java 
963873 
  /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 
963873 

Diff: http://review.hbase.org/r/296/diff


Testing
---

Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple 
times.  Ran and passed all tests when building.


Thanks,

Kris




 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-13 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888142#action_12888142
 ] 

Nicolas Spiegelberg commented on HBASE-2794:


Talked with Kris about setting proper exit conditions.

#1 : Exit if our error.rate  10%.  This is an arbitrary number.  Could easily 
make this configurable if someone needs it
#2 : Exit if it would take  1ms to run the bloom check.  This ensures that 
blooms are beneficial for performance even if they aren't needed 90% of the time

I wonder if it would be good to give the user an option of not running a bloom 
check if only 1 HFile in the StoreFile, but that's for another JIRA.

 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-13 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888144#action_12888144
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Nicolas nspiegelb...@facebook.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review397
---


Looking good!  Waiting for performance test numbers on StoreFile.shouldSeek().  
I think we want to early exit if shouldSeek() would take  1ms or something 
sensible.


/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
http://review.hbase.org/r/296/#comment1703

red = using tabs instead of spaces or trailing spaces.  quick fix might be 
nice (or is this auto-handled by svn, Stack?)



/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
http://review.hbase.org/r/296/#comment1702

could you add test header comments so we know all the cases you're trying 
to test?


- Nicolas





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-13 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888166#action_12888166
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Jonathan Gray jg...@apache.org


bq.  On 2010-07-13 18:09:13, Nicolas wrote:
bq.   
/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 
958
bq.   http://review.hbase.org/r/296/diff/3/?file=2723#file2723line958
bq.  
bq.   red = using tabs instead of spaces or trailing spaces.  quick fix 
might be nice (or is this auto-handled by svn, Stack?)

none of this is auto-handled by svn.  need to setup eclipse or whatever you use 
to use 2 spaces instead of tabs.  and in eclipse, i have my code cleanup set to 
remove whitespace and run that periodically.


- Jonathan


---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review397
---





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887277#action_12887277
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kris Jirapinyo kjirapi...@attensity.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/
---

Review request for hbase.


Summary
---

HBASE-2794 Enable bloom filter checks for multiple columns in same column family


This addresses bug HBASE-2794.
http://issues.apache.org/jira/browse/HBASE-2794


Diffs
-

  /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
962748 
  /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 
962748 

Diff: http://review.hbase.org/r/296/diff


Testing
---

Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple 
times.  Ran and passed all tests when building.


Thanks,

Kris




 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887433#action_12887433
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Nicolas nspiegelb...@facebook.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review350
---



/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
http://review.hbase.org/r/296/#comment1468

have you done any tests to see when the number of bloom checks takes 
significant time compared to just getting the block?  For example, if you have 
100 columns to lookup, do bloom filters really buy you anything, or shouldn't 
you just switch to a Row-level bloom anyways?  Also, with a default 1% error 
rate, you're looking at ~100% false positive with 100 columns.  Maybe 
max.columns = sqrt(1/error.rate)



/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
http://review.hbase.org/r/296/#comment1463

probably should pre-allocate the ArrayList() size so we only deal with one 
heap element.


- Nicolas





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887443#action_12887443
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kris Jirapinyo kjirapi...@attensity.com


bq.  On 2010-07-12 10:17:25, Nicolas wrote:
bq.   
/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 
860
bq.   http://review.hbase.org/r/296/diff/1/?file=2378#file2378line860
bq.  
bq.   probably should pre-allocate the ArrayList() size so we only deal 
with one heap element.

Good idea.


bq.  On 2010-07-12 10:17:25, Nicolas wrote:
bq.   
/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 
857
bq.   http://review.hbase.org/r/296/diff/1/?file=2378#file2378line857
bq.  
bq.   have you done any tests to see when the number of bloom checks takes 
significant time compared to just getting the block?  For example, if you have 
100 columns to lookup, do bloom filters really buy you anything, or shouldn't 
you just switch to a Row-level bloom anyways?  Also, with a default 1% error 
rate, you're looking at ~100% false positive with 100 columns.  Maybe 
max.columns = sqrt(1/error.rate)

I have not, but would running on just the test data be sufficent to tell the 
true savings since the tests just run on mock data?  I don't really have a dev 
cluster with real data that I can test this on, so perhaps you or someone could 
help out in that regard.


- Kris


---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review350
---





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887501#action_12887501
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kannan Muthukkaruppan kan...@facebook.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review361
---



/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
http://review.hbase.org/r/296/#comment1497

can't this loop be over columns itself? And then inside the loop, you 
prepare one key at a time use Bytes.add(row, col). That way, you can avoid the 
keyList data structure completely.


- Kannan





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887507#action_12887507
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kris Jirapinyo kjirapi...@attensity.com


bq.  On 2010-07-12 13:14:32, Kannan Muthukkaruppan wrote:
bq.   
/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 
880
bq.   http://review.hbase.org/r/296/diff/1/?file=2378#file2378line880
bq.  
bq.   can't this loop be over columns itself? And then inside the loop, 
you prepare one key at a time use Bytes.add(row, col). That way, you can avoid 
the keyList data structure completely.

Another good idea :) Will also get rid of the warning that keyList could 
possibly be null.


- Kris


---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review361
---





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887511#action_12887511
 ] 

ryan rawson commented on HBASE-2794:


Consider a table with 12 billion rows. At 9 bits/row, we are looking
at 135 bytes of ram (base) to store the blooms in ram. That is
12.57 GB ram to store the blooms.  The memory competes with the block
cache, thus you are losing 12.57 GB ram that could be used to cache
blocks.  If your data is in block cache, seeking is free, thus there
is an essential trade off here.

In my case, the 12b rows are small ones, and thus we have a lot of
rows for the actual data size.  On a different dataset, the row count
might be smaller for a the actual data size and it might be
worthwhile.  Furthermore, blooms don't work on Scans and only Gets.

The key takeaway here is that (a) bloom filters are not free and
potentially very expensive in terms of RAM, (b) bloom data competes
with the block cache, and (c) the trade off depends on the data set
and access patterns.



On Mon, Jul 12, 2010 at 12:07 PM, HBase Review Board (JIRA)



 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887537#action_12887537
 ] 

Nicolas Spiegelberg commented on HBASE-2794:


IRC conversation about this...

 krispyjala: nspiegelberg: but is the test you want related to HBASE-2794 or 
just using bloom filter in general (e.g. when to use it and when not to)?
 [1:41pm] nspiegelberg: it's related to 2794
 [1:42pm] nspiegelberg: an easy example of why you need good measurements is 
the case of calling bloom.contains() for 100 row+col in a 1% false positive 
bloom.  You are getting almost 100% false positives then, so the bloom is an 
obvious perf drop
 [1:43pm] krispyjala: nspiegelberg: ok i think i understand
 [1:44pm] krispyjala: nspiegelberg: but wait 100% false positive?
 [1:46pm] nspiegelberg: right, so io.hfile.bloom.error.rate == .01 by default.  
so 1%
 [1:46pm] krispyjala: ok
 [1:46pm] krispyjala: how does that add up to 100% for 100 lookups?
 [1:46pm] nspiegelberg: therefore, if you call bloom.contains() 5 times and OR 
the result, the false positive rate is 5%
 [1:49pm] nspiegelberg: krispyjala: so a simple example. call bloom.contains() 
10 times = 10% error rate = (10ms/seek * 10%) + time(bloom.contains)
 [1:50pm] krispyjala: nspiegelberg: but is it really OR'ing all of them? In the 
code if even one column lookup returns true we return true and don't look up 
any other columns
 [1:51pm] nspiegelberg: right, that's the same thing as ORing them
 [1:51pm] nspiegelberg: logical OR =  ||
 [1:52pm] krispyjala: nspiegelberg: but the point is we're probably not looking 
up 100 columns every time for that operation, even theoretically yes we do a 
logical OR
 [1:52pm] krispyjala: if we hit true on the 5th column, we quit the loop and 
return right away
 [1:53pm] nspiegelberg: the only way you win with blooms is if all 
bloom.contains() return false and you don't have to do the lookup
 [1:53pm] krispyjala: yes
 [1:53pm] nspiegelberg: so, you're right, we do an average of 50 lookups per 
false positive in this case.
 [1:54pm] nspiegelberg: I'm just saying, what is the cost of those 50 lookups?  
If 1ms, then every HFile seek costs 11ms with blooms enabled versus 10 ms 
without using them
 [1:55pm] krispyjala: but wait i thought the code was to determine whether to 
add the StoreScanner to the list or not...or are you saying then that the point 
is in the case of 100 columns we should just not even bother doing bloom 
multicolumn check because perhaps it's better to just load it than wasting time 
with the 100 lookups (potentially)
 [1:55pm] nspiegelberg: exactly
 [1:55pm] krispyjala: nspiegelberg: lol ok got it
 [1:56pm] krispyjala: but realistically, who does gets on 100 columns? I don't 
know the HBase internals well yet (that's why i picked the noob ticket 
lol)...wouldn't it be better to just do a get on the row?
 [1:57pm] nspiegelberg: never under-estimate the naivete of users
 [1:57pm] krispyjala: nspiegelberg: sigh lol, i guess that's why the bloom is 
off by default?
 [1:58pm] nspiegelberg: yes
 [1:58pm] nspiegelberg: so, it's obvious that you never want to run bloom code 
with 101 columns + 1% error rate
 [1:58pm] krispyjala: correct
 [1:59pm] nspiegelberg: so, really it's just timing testBloomPerf with various 
lookup counts on various size blooms
 [2:00pm] krispyjala: nspiegelberg: this talk has helped me think about how to 
test like you said
 [2:00pm] • St^Ack hopes the above good-stuff(tm) 'lesson' makes it back into 
the issue
 [2:00pm] nspiegelberg: looks like ryan didn't give you any concrete numbers, 
so you might have to just start with some assumptions (like, don't use blooms 
if avg key  1KB) and run with that
 [2:01pm] krispyjala: nspiegelberg: and perhaps once we kind of know where the 
tradeoff is, would it be wrong to limit in the code saying if there are more 
than say 10 column lookups might as well just return true?
 [2:01pm] krispyjala: cuz it's not worth looking up in bloom at that point
 [2:01pm] nspiegelberg: I think that's exactly what we need to do
 [2:01pm] krispyjala: whatever the threshold is
 [2:02pm] nspiegelberg: if we pretend that the cost of bloom.contains() == 0, 
then maybe we want to say if (column.count * error.rate  10%) return true;
 [2:02pm] dj_ryan: well it's hard to say where the tradeoff goes
 [2:02pm] krispyjala: pastebin? lol jk
 [2:02pm] dj_ryan: but the hard number is 9 bits/item
 [2:03pm] dj_ryan: you can then calculate how much ram you are spending on 
blooms
 [2:03pm] dj_ryan: and decide if its worth it
 [2:03pm] nspiegelberg: the hard # for 1% error rate blooms is 9 bits/item
 [2:03pm] dj_ryan: we never implemented blooms because it seemed 12gb of ram 
would be better off caching
 [2:03pm] krispyjala: dj_ryan: so your suggestion the onus is on the user and 
not hbase code
 [2:03pm] nspiegelberg: with .1% error rate, it's ~12 bits/item
 [2:04pm] krispyjala: 

[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887638#action_12887638
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kris Jirapinyo kjirapi...@attensity.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/
---

(Updated 2010-07-12 19:48:43.373418)


Review request for hbase.


Changes
---

Implemented Kannan's suggestion, thereby removing keyList.


Summary
---

HBASE-2794 Enable bloom filter checks for multiple columns in same column family


This addresses bug HBASE-2794.
http://issues.apache.org/jira/browse/HBASE-2794


Diffs (updated)
-

  /trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 
962748 
  /trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 
962748 

Diff: http://review.hbase.org/r/296/diff


Testing
---

Ran and passed org.apache.hadoop.hbase.regionserver.TestStoreFile multiple 
times.  Ran and passed all tests when building.


Thanks,

Kris




 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-12 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887659#action_12887659
 ] 

HBase Review Board commented on HBASE-2794:
---

Message from: Kannan Muthukkaruppan kan...@facebook.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review384
---


One inlined comment. Otherwise, the patch and the test look good.


/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
http://review.hbase.org/r/296/#comment1622

Once Pranav's patch for HBase-2265 lands, the shouldSeek() API will take a 
Scan as the first argument instead of the row. So, you might need to rebase 
the test with respect to that patch.


- Kannan





 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-07-11 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887274#action_12887274
 ] 

ryan rawson commented on HBASE-2794:


can you also upload it to review.hbase.org for easy reviewing, thanks :-)




 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
 Attachments: 2794_multi_column_check.txt


 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

2010-06-26 Thread Kannan Muthukkaruppan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882820#action_12882820
 ] 

Kannan Muthukkaruppan commented on HBASE-2794:
--

Perhaps a simple starter task for someone interested.

 ROWCOL bloom filter not used if multiple columns within same family are 
 requested in a Get
 --

 Key: HBASE-2794
 URL: https://issues.apache.org/jira/browse/HBASE-2794
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan

 Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
 {code}
 switch(bloomFilterType) {
   case ROW:
 key = row;
 break;
   case ROWCOL:
 if (columns.size() == 1) {
   byte[] col = columns.first();
   key = Bytes.add(row, col);
   break;
 }
 //$FALL-THROUGH$
   default:
 return true;
 }
 {code}
 If columns.size  1, then we currently don't take advantage of the bloom 
 filter.  We should optimize this to check bloom for each of columns and if 
 none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.