Bloom Filter to filter huge dataframes with PySpark

Breno Arosa Wed, 23 Sep 2020 07:58:57 -0700

Hello,

I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?


As I'm ok with false positives maybe Bloom filter is an alternative.

I saw that Scala has a builtin dataframe function(https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions)but there is no correspondent function in pyspark.

Is there any way to call this within pyspark?

I'm using spark 2.4.3.

Thanks,
Breno Arosa.

ps:

I'm not sharing the real query but here is a very similar use case,consider that both `items` and `item_tags` are huge.


   WITH items_tag_a AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'a')
   ),

   items_tag_b AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'b')
   ),

   items_tag_c AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'c')
   )

   SELECT id
   FROM item_tag_a
   WHERE id IN (SELECT id FROM item_tag_b)
   AND id IN (SELECT id FROM item_tag_c)

Bloom Filter to filter huge dataframes with PySpark

Reply via email to