Re: [Discussion] How to build bloom filter in parquet

2019-03-04 Thread Jim Apple
> A bit of brainstorming (just some ideas that may or may not be useful): One > more thing to consider is whether some smart encoding of the bit vector > would help saving space. I expect the entropy of a nearly empty or nearly > full bloom filter to be relatively low, because they consist mostly

Re: [Discussion] How to build bloom filter in parquet

2019-01-21 Thread 俊杰陈
Thanks Gabor and Zoltan. see my answers and replies below: Q: Which columns do we write bloom filters for? A: It would be better to let users choose themselves, they know which column need to set to build column index and which column they want to build bloom filter. We just provide options.

Re: [Discussion] How to build bloom filter in parquet

2019-01-17 Thread Zoltan Ivanfi
Hi, I like the idea of specifying the maximum acceptable size of the bloom filter bit vector. I think it would be much better than specifying the expected number of distinct values (which we can not expect from the API consumer in my opinion). The desired false positives probability could still

Re: [Discussion] How to build bloom filter in parquet

2019-01-17 Thread Gabor Szadovszky
Thanks for raising this, Junjie. One more topic worth to add: Which columns do we want to write bloom filters for? May it depend on the type? Is bloom filter required if we have dictionary? Is bloom filter required if the column is ordered and we have column indexes? (etc.) On Thu, Jan 17,

[Discussion] How to build bloom filter in parquet

2019-01-17 Thread 俊杰陈
Hi Parquet Developers In the bloom filter design doc we have discussed and determined bloom filter definition, now I'd like to invite you to discuss how to build a bloom filter in parquet. In my current implementation, a bloom filter is created first according to specified number of distinct