Hi All, I believe that there is a bug that affects the Spark BloomFilter implementation when creating a bloom filter with large n. Since this implementation uses integer hash functions, it doesn’t work properly when the number of bits exceeds MAX_INT.
I asked a question about this on stackoverflow, but didn’t get a satisfactory answer. I believe I know what is causing the bug and have documented my reasoning there as well: https://stackoverflow.com/questions/78162973/why-is-observed-false-positive-rate-in-spark-bloom-filter-higher-than-expected I would just go ahead and create a Jira ticket on the spark jira board, but I’m still waiting to hear back regarding getting my account set up. Huge thanks if anyone can help! -N