Hi

To get optimal performance from bloom filter, make sure the records of col1 are 
sorted. Sorted on the column of interest will efficiently prune stripes and row 
groups. If the records that you are searching for is spread across row groups 
(10K rows by default) or stripes (64MB by default) then ORC reader will have to 
read all/most of the row groups and stripes. Sorting helps to cluster the 
records together to making pruning better.

Thanks
Prasanth

On Jan 28, 2016, at 6:46 PM, Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:

All,

I have a huge table that I periodically want to do select on some particular 
value. For example, supposing I have a table for the entire world population. 
Then I know the id of “1234” is criminal, hence I want to pull out his 
information from the table.

Without any optimization, I have to use thousands of mappers to find just one 
id. So not ideal. I tried to enable bloom-filter on the column that I want to 
search on. But a simple query shows that the amount of data read is the same as 
that without a bloom-filter. So I am questioning whether it is enabled on the 
version I am on, which is 0.14. Does anyone know? If bloom-filter is not the 
way to go, does anyone have suggestions?

Here is the hql:

create table test
(
  col1   STRING,
   col2   STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");

select * from test where col1 = ‘1234’;

Thx

Frank
[”MerkleONE”]<http://www2.merkleinc.com/janfooter>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.

Reply via email to