Hello,

I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?

As I'm ok with false positives maybe Bloom filter is an alternative.
I saw that Scala has a builtin dataframe function (https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) but there is no correspondent function in pyspark.
Is there any way to call this within pyspark?

I'm using spark 2.4.3.

Thanks,
Breno Arosa.

ps:
I'm not sharing the real query but here is a very similar use case, consider that both `items` and `item_tags` are huge.

   WITH items_tag_a AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'a')
   ),

   items_tag_b AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'b')
   ),

   items_tag_c AS (
        SELECT id
        FROM items
        WHERE id IN (SELECT item_id FROM item_tags WHERE tag = 'c')
   )

   SELECT id
   FROM item_tag_a
   WHERE id IN (SELECT id FROM item_tag_b)
   AND id IN (SELECT id FROM item_tag_c)

Reply via email to