Hi All,

I believe that there is a bug that affects the Spark BloomFilter implementation 
when creating a bloom filter with large n. Since this implementation uses 
integer hash functions, it doesn’t work properly when the number of bits 
exceeds MAX_INT.

I asked a question about this on stackoverflow, but didn’t get a satisfactory 
answer. I believe I know what is causing the bug and have documented my 
reasoning there as well:

https://stackoverflow.com/questions/78162973/why-is-observed-false-positive-rate-in-spark-bloom-filter-higher-than-expected

I would just go ahead and create a Jira ticket on the spark jira board, but I’m 
still waiting to hear back regarding getting my account set up.

Huge thanks if anyone can help!

-N

Reply via email to