[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102336#comment-17102336
 ] 

linna shuang commented on SPARK-16951:
--------------------------------------

In TPC-H test, we met performance issue of Q16, which used NOT IN subquery and 
being translated into broadcast nested loop join. This query uses almost half 
time of total 22 queries. For example, 512GB data set, totally execution time 
is 1400 seconds, while Q16’s execution time is 630 seconds.

TPC-H is a common spark sql performance benchmark, this performance issue will 
be met usually. Do you have plan to reopen and fix this issue?

> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>
>                 Key: SPARK-16951
>                 URL: https://issues.apache.org/jira/browse/SPARK-16951
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>            Priority: Major
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to