Florentino Sainz created SPARK-31417:
----------------------------------------

             Summary: Allow broadcast variables when using isin code
                 Key: SPARK-31417
                 URL: https://issues.apache.org/jira/browse/SPARK-31417
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0, 3.0.0
            Reporter: Florentino Sainz


*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*


I'm coming from 
https://stackoverflow.com/questions/61111172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible 
to extend current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to