Song Jun created SPARK-27229: -------------------------------- Summary: GroupBy Placement in Intersect Distinct Key: SPARK-27229 URL: https://issues.apache.org/jira/browse/SPARK-27229 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Song Jun
Intersect operator will be replace by Left Semi Join in Optimizer. for example: SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before Join by place groupby operator under join, that is ==> SELECT a1, a2 FROM (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X LEFT SEMI JOIN (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 then we can have smaller table data when execute join, because group by has cut lots of data -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org