[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513830#comment-15513830 ] Zelaine Fong commented on DRILL-4539: - Note that this was addressed in Calcite in CALCITE-1200. But because Drill is on an older Calcite fork, it doesn't have this change. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411069#comment-15411069 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti closed the pull request at: https://github.com/apache/drill/pull/462 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243693#comment-15243693 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r59943901 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java --- @@ -22,25 +22,36 @@ import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonTypeName; +import org.apache.drill.exec.physical.impl.join.JoinUtils.JoinComparator; @JsonTypeName("hashtable-config") public class HashTableConfig { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashTableConfig.class); - private final int initialCapacity; private final float loadFactor; private final NamedExpression[] keyExprsBuild; private final NamedExpression[] keyExprsProbe; + private final JoinComparator[] comparators; @JsonCreator public HashTableConfig(@JsonProperty("initialCapacity") int initialCapacity, @JsonProperty("loadFactor") float loadFactor, @JsonProperty("keyExprsBuild") NamedExpression[] keyExprsBuild, - @JsonProperty("keyExprsProbe") NamedExpression[] keyExprsProbe) { + @JsonProperty("keyExprsProbe") NamedExpression[] keyExprsProbe, + @JsonProperty("comparators") JoinComparator[] comparators) { this.initialCapacity = initialCapacity; this.loadFactor = loadFactor; this.keyExprsBuild = keyExprsBuild; this.keyExprsProbe = keyExprsProbe; +if (comparators != null) { + this.comparators = comparators; +} else { + // default is nulls are equal --- End diff -- Agree, it is clean to remove passing null and explicitly pass a value to comparator. Updated patch. Also renamed join.JoinComparator to common.Comparator. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243660#comment-15243660 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210647305 I had one comment on the latest commit but overall LGTM. +1 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243656#comment-15243656 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r59941224 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java --- @@ -22,25 +22,36 @@ import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonTypeName; +import org.apache.drill.exec.physical.impl.join.JoinUtils.JoinComparator; @JsonTypeName("hashtable-config") public class HashTableConfig { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashTableConfig.class); - private final int initialCapacity; private final float loadFactor; private final NamedExpression[] keyExprsBuild; private final NamedExpression[] keyExprsProbe; + private final JoinComparator[] comparators; @JsonCreator public HashTableConfig(@JsonProperty("initialCapacity") int initialCapacity, @JsonProperty("loadFactor") float loadFactor, @JsonProperty("keyExprsBuild") NamedExpression[] keyExprsBuild, - @JsonProperty("keyExprsProbe") NamedExpression[] keyExprsProbe) { + @JsonProperty("keyExprsProbe") NamedExpression[] keyExprsProbe, + @JsonProperty("comparators") JoinComparator[] comparators) { this.initialCapacity = initialCapacity; this.loadFactor = loadFactor; this.keyExprsBuild = keyExprsBuild; this.keyExprsProbe = keyExprsProbe; +if (comparators != null) { + this.comparators = comparators; +} else { + // default is nulls are equal --- End diff -- It might seem confusing at first (it was to me) to consider the nulls-are-equal as the default case since that's not true for joins; however it is true for a group-by operation. Could you have the caller pass in explicit comparator such that the HashTableConfig does not have to decide ? The HashAggregate could instantiate the hash table with an IS_NOT_DISTINCT_FROM comparator. What do you think ? (one minor point is that the 'JoinComparator' is now being used for both Joins and Aggregates...we could call it Comparator or something generic). > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243221#comment-15243221 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210546147 Updated PR to handle mixed EQUALS and IS NOT DISTINCT FROM in same join condition. @amansinha100 Could you please review the latest commit in the PR? > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242283#comment-15242283 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210244225 It seems like mixed condition of EQUALS and IS NOT DISTINCT FROM in the same join condition can be supported few changes. Here are the WIP changes [1]. Need to test more and clean it up. [1] https://github.com/vkorukanti/drill/commit/0c8a4a2cef926492853e9c8ae856b82615be3c72 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241672#comment-15241672 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210085324 I think additional support may be needed; the hash table only stores the equality join keys, so the generated code in ChainedHashTable is targeted to the those columns. For the additional columns you will need to keep track of them (I haven't thought through the details yet.) > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241530#comment-15241530 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210056992 I haven't tried this, but wondering whether this is sufficient to handle mix of EQUALS and IS NOT DISTINCT FROM cases. Maintain a list (one for each join condition) instead of single "areNullsEquals" flag in MergeJoinBatch [1] and ChainedHashTable [2] and generate the code accordingly for each condition. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java#L463 [2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java#L251 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241497#comment-15241497 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-210047513 Right, the HashJoin and MergeJoin currently can handle one type of comparison, not mixed...this is why when we have a join condition such as t1.a1 = t2.a2 AND t2.b1 < t2.b2 then Drill planner currently will split it up into equality join and the non-equality conditions. The equality is handled by the join operator while a new Filter node is added after the join to process the non-equality. I had created a JIRA to be able to handle both types of comparisons in the join operator: see https://issues.apache.org/jira/browse/DRILL-3803.Do you need this support for this particular null equality join issue ? I would have thought that your rewrite rule would produce the IS NOT DISTINCT FROM condition, but not a mixed condition. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240832#comment-15240832 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-209836956 Added few more tests: ```SELECT * FROM t1, t2, t3 WHERE t1.a = t2.a OR (t1.b is null and t3.b is null)``` to test it throws CannotPlanException ```SELECT * FROM t1, t2 WHERE t1.a = t2.a OR (t1.a is null and t2.a is null)``` to test null-equality filter is pushed into join condition. Had to make some changes to ```DrillFilterJoinRules``` to handle this case. Also few more tests around 3 table join that involve null-equality condition in ```WHERE``` clause and ```JOIN ON``` 4 tests from extended regression suite failed, because now we generate a join condition that involves ```=`` and ```IS NOT DISTINCT FROM```. ```SELECT * FROM cp.`jsonInput/nullableOrdered1.json` t1 JOIN cp.`jsonInput/nullableOrdered2.json` t2 ON t1.key = t2.key WHERE t1.data IS NOT DISTINCT FROM t2.data``` It looks like HashJoin/MergeJoin operators are not handling mixed comparator in join condition. Looking at the ```ChainedHashTable``` and ```MergeJoinBatch``` code briefly, it seems like we should be able to generate code for handling mixed comparators case. @amansinha100: Let me know if I am not correct. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228891#comment-15228891 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58761862 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java --- @@ -169,4 +176,223 @@ private static boolean containIdentity(List exps, } return true; } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(RelNode, RelNode, RexNode, List, List)}. Modified to rewrite + * the null equal join condition using IS NOT DISTINCT FROM operator. + * + * Splits out the equi-join components of a join condition, and returns + * what's left. For example, given the condition + * + * L.A = R.X AND L.B = L.C AND (L.D = 5 OR L.E = + * R.Y) + * + * returns + * + * + * leftKeys = {A} + * rightKeys = {X} + * rest = L.B = L.C AND (L.D = 5 OR L.E = R.Y) + * + * + * @param left left input to join + * @param right right input to join + * @param condition join condition + * @param leftKeys The ordinals of the fields from the left input which are + * equi-join keys + * @param rightKeys The ordinals of the fields from the right input which + * are equi-join keys + * @param joinOps List of equi-join operators (EQUALS or IS NOT DISTINCT FROM) used to join the left and right keys. + * @return remaining join filters that are not equijoins; may return a + * {@link RexLiteral} true, but never null + */ + public static RexNode splitJoinCondition( + RelNode left, + RelNode right, + RexNode condition, + List leftKeys, + List rightKeys, + List joinOps) { +final List nonEquiList = new ArrayList<>(); + +splitJoinCondition( +left.getRowType().getFieldCount(), +condition, +leftKeys, +rightKeys, +joinOps, +nonEquiList); + +return RexUtil.composeConjunction( +left.getCluster().getRexBuilder(), nonEquiList, false); + } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(int, RexNode, List, List, List)}. Modified to rewrite the null + * equal join condition using IS NOT DISTINCT FROM operator. + */ + private static void splitJoinCondition( --- End diff -- Rest looks good to me. +1. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228889#comment-15228889 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58761755 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java --- @@ -169,4 +176,223 @@ private static boolean containIdentity(List exps, } return true; } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(RelNode, RelNode, RexNode, List, List)}. Modified to rewrite + * the null equal join condition using IS NOT DISTINCT FROM operator. + * + * Splits out the equi-join components of a join condition, and returns + * what's left. For example, given the condition + * + * L.A = R.X AND L.B = L.C AND (L.D = 5 OR L.E = + * R.Y) + * + * returns + * + * + * leftKeys = {A} + * rightKeys = {X} + * rest = L.B = L.C AND (L.D = 5 OR L.E = R.Y) + * + * + * @param left left input to join + * @param right right input to join + * @param condition join condition + * @param leftKeys The ordinals of the fields from the left input which are + * equi-join keys + * @param rightKeys The ordinals of the fields from the right input which + * are equi-join keys + * @param joinOps List of equi-join operators (EQUALS or IS NOT DISTINCT FROM) used to join the left and right keys. + * @return remaining join filters that are not equijoins; may return a + * {@link RexLiteral} true, but never null + */ + public static RexNode splitJoinCondition( + RelNode left, + RelNode right, + RexNode condition, + List leftKeys, + List rightKeys, + List joinOps) { +final List nonEquiList = new ArrayList<>(); + +splitJoinCondition( +left.getRowType().getFieldCount(), +condition, +leftKeys, +rightKeys, +joinOps, +nonEquiList); + +return RexUtil.composeConjunction( +left.getCluster().getRexBuilder(), nonEquiList, false); + } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(int, RexNode, List, List, List)}. Modified to rewrite the null + * equal join condition using IS NOT DISTINCT FROM operator. + */ + private static void splitJoinCondition( --- End diff -- Can you confirm if this rewrite does *not* do the conversion if the join condition happens to involve columns coming from not just 2 tables but from 3 tables ? e.g if the user accidentally gives: SELECT * FROM t1, t2, t3 WHERE t1.a = t2.a OR (t1.b is null and t3.b is null) > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228838#comment-15228838 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58757405 --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestJoinNullable.java --- @@ -407,11 +342,94 @@ public void testMergeLOJNullableBothInputsOrderedDescNullsLastVsAscNullsLast() t + " ORDER BY 1 ASC NULLS LAST ) t2 " + "USING ( key )", TEST_RES_PATH, TEST_RES_PATH); -final int expectedRecordCount = 6; +testHelper(query, 6, false, true); + } + + @Test + public void withDistinctFromJoinConditionHashJoin() throws Exception { +final String query = "SELECT * FROM " + +"cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + +"cp.`jsonInput/nullableOrdered2.json` t2 " + +"ON t1.key IS NOT DISTINCT FROM t2.key AND t1.data is NOT null"; +nullEqualJoinHelper(query); + } + + @Test + public void withDistinctFromJoinConditionMergeJoin() throws Exception { +try { + test("alter session set `planner.enable_hashjoin` = false"); + final String query = "SELECT * FROM " + + "cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + + "cp.`jsonInput/nullableOrdered2.json` t2 " + + "ON t1.key IS NOT DISTINCT FROM t2.key"; + nullEqualJoinHelper(query); +} finally { + test("alter session set `planner.enable_hashjoin` = true"); +} + } + + @Test + public void withNullEqualHashJoin() throws Exception { +final String query = "SELECT * FROM " + +"cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + +"cp.`jsonInput/nullableOrdered2.json` t2 " + +"ON t1.key = t2.key OR (t1.key IS NULL AND t2.key IS NULL)"; +nullEqualJoinHelper(query); + } -enableJoin(false, true); -final int actualRecordCount = testSql(query); -assertEquals("Number of output rows", expectedRecordCount, actualRecordCount); + @Test + public void withNullEqualMergeJoin() throws Exception { +try { + test("alter session set `planner.enable_hashjoin` = false"); + final String query = "SELECT * FROM " + + "cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + + "cp.`jsonInput/nullableOrdered2.json` t2 " + + "ON t1.key = t2.key OR (t1.key IS NULL AND t2.key IS NULL)"; + nullEqualJoinHelper(query); +} finally { + test("alter session set `planner.enable_hashjoin` = true"); +} + } + + public void nullEqualJoinHelper(final String query) throws Exception { +testBuilder() +.sqlQuery(query) +.unOrdered() +.baselineColumns("key", "data", "data0", "key0") +.baselineValues(null, "L_null_1", "R_null_1", null) +.baselineValues(null, "L_null_2", "R_null_1", null) +.baselineValues("A", "L_A_1", "R_A_1", "A") +.baselineValues("A", "L_A_2", "R_A_1", "A") +.baselineValues(null, "L_null_1", "R_null_2", null) +.baselineValues(null, "L_null_2", "R_null_2", null) +.baselineValues(null, "L_null_1", "R_null_3", null) +.baselineValues(null, "L_null_2", "R_null_3", null) +.go(); } + @Test + public void withNullEqualAdditionFilter() throws Exception { --- End diff -- Sure. I will update the patch with new tests. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228833#comment-15228833 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58757060 --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestJoinNullable.java --- @@ -407,11 +342,94 @@ public void testMergeLOJNullableBothInputsOrderedDescNullsLastVsAscNullsLast() t + " ORDER BY 1 ASC NULLS LAST ) t2 " + "USING ( key )", TEST_RES_PATH, TEST_RES_PATH); -final int expectedRecordCount = 6; +testHelper(query, 6, false, true); + } + + @Test + public void withDistinctFromJoinConditionHashJoin() throws Exception { +final String query = "SELECT * FROM " + +"cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + +"cp.`jsonInput/nullableOrdered2.json` t2 " + +"ON t1.key IS NOT DISTINCT FROM t2.key AND t1.data is NOT null"; +nullEqualJoinHelper(query); + } + + @Test + public void withDistinctFromJoinConditionMergeJoin() throws Exception { +try { + test("alter session set `planner.enable_hashjoin` = false"); + final String query = "SELECT * FROM " + + "cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + + "cp.`jsonInput/nullableOrdered2.json` t2 " + + "ON t1.key IS NOT DISTINCT FROM t2.key"; + nullEqualJoinHelper(query); +} finally { + test("alter session set `planner.enable_hashjoin` = true"); +} + } + + @Test + public void withNullEqualHashJoin() throws Exception { +final String query = "SELECT * FROM " + +"cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + +"cp.`jsonInput/nullableOrdered2.json` t2 " + +"ON t1.key = t2.key OR (t1.key IS NULL AND t2.key IS NULL)"; +nullEqualJoinHelper(query); + } -enableJoin(false, true); -final int actualRecordCount = testSql(query); -assertEquals("Number of output rows", expectedRecordCount, actualRecordCount); + @Test + public void withNullEqualMergeJoin() throws Exception { +try { + test("alter session set `planner.enable_hashjoin` = false"); + final String query = "SELECT * FROM " + + "cp.`jsonInput/nullableOrdered1.json` t1 JOIN " + + "cp.`jsonInput/nullableOrdered2.json` t2 " + + "ON t1.key = t2.key OR (t1.key IS NULL AND t2.key IS NULL)"; + nullEqualJoinHelper(query); +} finally { + test("alter session set `planner.enable_hashjoin` = true"); +} + } + + public void nullEqualJoinHelper(final String query) throws Exception { +testBuilder() +.sqlQuery(query) +.unOrdered() +.baselineColumns("key", "data", "data0", "key0") +.baselineValues(null, "L_null_1", "R_null_1", null) +.baselineValues(null, "L_null_2", "R_null_1", null) +.baselineValues("A", "L_A_1", "R_A_1", "A") +.baselineValues("A", "L_A_2", "R_A_1", "A") +.baselineValues(null, "L_null_1", "R_null_2", null) +.baselineValues(null, "L_null_2", "R_null_2", null) +.baselineValues(null, "L_null_1", "R_null_3", null) +.baselineValues(null, "L_null_2", "R_null_3", null) +.go(); } + @Test + public void withNullEqualAdditionFilter() throws Exception { --- End diff -- Could you also do similar test with the join condition in the WHERE clause instead of ON clause ? i.e something like: SELECT * FROM t1, t2 WHERE t1.a = t2.a OR (t1.a is null and t2.a is null) For such cases, Calcite filter pushdown into join needs to be applied first. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, >
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228757#comment-15228757 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58751970 --- Diff: exec/java-exec/src/main/codegen/templates/ComparisonFunctions.java --- @@ -215,6 +192,36 @@ public void eval() { } } + <#-- IS_DISTINCT_FROM function --> + @FunctionTemplate(names = {"is_distinct_from", "is distinct from" }, --- End diff -- I added tests for each category of template code path (primitive type, decimal type and interval type) in TestIsDistinctFromFunctions.java > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228746#comment-15228746 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58751332 --- Diff: exec/java-exec/src/main/codegen/templates/ComparisonFunctions.java --- @@ -215,6 +192,36 @@ public void eval() { } } + <#-- IS_DISTINCT_FROM function --> + @FunctionTemplate(names = {"is_distinct_from", "is distinct from" }, --- End diff -- I am not opposed to having a native implementation of IS [NOT] DISTINCT FROM...clearly the generated code is more compact; however adding this new functions means we would need proper functional test coverage for various data types. Any thoughts regarding that ? > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228651#comment-15228651 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58742916 --- Diff: exec/java-exec/src/main/codegen/templates/ComparisonFunctions.java --- @@ -215,6 +192,36 @@ public void eval() { } } + <#-- IS_DISTINCT_FROM function --> + @FunctionTemplate(names = {"is_distinct_from", "is distinct from" }, --- End diff -- If adding new functions is a concern, I can make the ```RelOptUtil#splitJoinCondition``` to identify rewritten ```IS NOT DISTINCT FROM``` functions also. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228646#comment-15228646 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58742546 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java --- @@ -169,4 +176,223 @@ private static boolean containIdentity(List exps, } return true; } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(RelNode, RelNode, RexNode, List, List)}. Modified to rewrite --- End diff -- I will followup with a JIRA on Calcite project to see if we can push this change to Calcite. The function ```RelOptUtil#splitJoinCondition``` in the current form itself seems to have a problem/limitation. Currently it just returns the left and right join key indices, but doesn't return whether the condition is ```EQUAL``` or ```IS NOT DISTINCT FROM``` (it adds the key pair if they have either of these function in comparison). > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228628#comment-15228628 ] ASF GitHub Bot commented on DRILL-4539: --- Github user vkorukanti commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58741416 --- Diff: exec/java-exec/src/main/codegen/templates/ComparisonFunctions.java --- @@ -215,6 +192,36 @@ public void eval() { } } + <#-- IS_DISTINCT_FROM function --> + @FunctionTemplate(names = {"is_distinct_from", "is distinct from" }, --- End diff -- I am not sure if there is way to differentiate between the function in join condition vs. function in project expr. I don't see any context info in DrillConvertletTable.get() method call. Also the generated code in rewritten case is too much. For following query: ```SELECT INT_col is not distinct from BIGINT_col as col, int_distinct_result FROM cp.`functions/distinct_from.json``` Without rewrite: https://gist.github.com/vkorukanti/e981058f985ed24e6c4ef6b47d670e0f With rewrite: https://gist.github.com/vkorukanti/d80aa2ba40c65c9215c38ed18b20a685 Sizes may differ after scalar replacement is done, but it is still too much code to simple ```is not distinct from``` function. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228503#comment-15228503 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58730151 --- Diff: exec/java-exec/src/main/codegen/templates/ComparisonFunctions.java --- @@ -215,6 +192,36 @@ public void eval() { } } + <#-- IS_DISTINCT_FROM function --> + @FunctionTemplate(names = {"is_distinct_from", "is distinct from" }, --- End diff -- @vkorukanti, I want to clarify...if the query only had a join condition with IS_NOT_DISTINCT_FROM, I would think it should work just with your convertlet changes, since both HashJoin and MergeJoin handle this type of join condition. Is the reason you had to implement the full comparator codegen to handle more general types of comparisons ? e.g in the SELECT list if I say 'SELECT a IS NOT DISTINCT FROM b' ?Suppose we had a convertlet that only preserved the IS (NOT) DISTINCT FROM join condition, and defaulted to the Calcite rewrite using the CASE expression, then we would not have to implement the full comparator. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228484#comment-15228484 ] ASF GitHub Bot commented on DRILL-4539: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58728331 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java --- @@ -169,4 +176,223 @@ private static boolean containIdentity(List exps, } return true; } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(RelNode, RelNode, RexNode, List, List)}. Modified to rewrite --- End diff -- Agree that we ideally should leverage the Calcite code..especially since this method is pretty heavily used and modified periodically so keeping Drill's version of this method in sync will be difficult. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227610#comment-15227610 ] ASF GitHub Bot commented on DRILL-4539: --- Github user jacques-n commented on a diff in the pull request: https://github.com/apache/drill/pull/462#discussion_r58645138 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java --- @@ -169,4 +176,223 @@ private static boolean containIdentity(List exps, } return true; } + + /** + * Copied from {@link RelOptUtil#splitJoinCondition(RelNode, RelNode, RexNode, List, List)}. Modified to rewrite --- End diff -- Is there a way to make this not copied? @julianhyde, any ideas for how to refactor common stuff out? > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227606#comment-15227606 ] ASF GitHub Bot commented on DRILL-4539: --- Github user jacques-n commented on the pull request: https://github.com/apache/drill/pull/462#issuecomment-206092089 @amansinha100, can you take a look? > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226910#comment-15226910 ] Venki Korukanti commented on DRILL-4539: RelOptUtil.splitJoinCondition() is copied from Calcite and modified to: 1) All null equality conditions 2) Also output the operator (EQUALS, IS NOT DISTINCT FROM) to use between left and right keys. We need this info to reconstruct the join condition. If the changes look ok, I will check with Calcite community to see if we can push this change to Calcite. Seems like a generic case. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226892#comment-15226892 ] ASF GitHub Bot commented on DRILL-4539: --- GitHub user vkorukanti opened a pull request: https://github.com/apache/drill/pull/462 DRILL-4539: Add support for Null Equality Joins @amansinha100 @jacques-n Could you please review the patches? You can merge this pull request into a Git repository by running: $ git pull https://github.com/vkorukanti/drill null_rewrite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/462.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #462 commit 3a1db78e24fe0b956659365a1267d9aee79a2955 Author: vkorukanti Date: 2016-04-01T23:44:24Z DRILL-4539 (part 1): Use custom convertlet table for IS [NOT] DISTINCT FROM functions Test changes: + Add join (hash and merge) tests that have IS NOT DISTINCT FROM in the join condition + Refactor/fix tests in TestJoinNullable class to reset options properly at the end of the test. commit d9a149007e58c4219b7d0d1d9647558e27a401b8 Author: vkorukanti Date: 2016-04-04T23:25:10Z DRILL-4539 (part 2): Implement IS_DISTINCT_FROM and IS_NOT_DISTINCT_FROM functions commit af40ee7095d02ec10bf500c24f04433e02ccb007 Author: vkorukanti Date: 2016-04-05T18:19:07Z DRILL-4539 (part 3): Rewrite null equal join condition as IS NOT DISTINCT FROM > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224702#comment-15224702 ] Venki Korukanti commented on DRILL-4539: The patch I posted is not complete yet. Currently Drill doesn't have {{IS_DISTINCT_FROM}} or {{IS_NOT_DISTINCT_FROM}} implementations. If we don't let Calcite rewrite these functions anymore, then we need to provided implementations in Drill. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224693#comment-15224693 ] Venki Korukanti commented on DRILL-4539: When the query {{IS_NOT_DISTINCT_FROM}} function, it is rewritten in SQL-to-Rel conversion phase. In DRILL-2092, AggregateExpandDistinctAggregatesRule expands aggregates as {{IS_NOT_DISTINCT_FROM}} function. Once rewritten, it doesn't go through the convertlet table conversion. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222726#comment-15222726 ] Aman Sinha commented on DRILL-4539: --- Interesting that Calcite's default behavior is to do the conversion of IS_NOT_DISTINCT_FROM to CASE if it was specified in the original query considering that Calcite itself internally generates the HashJoin plan with IS_NOT_DISTINCT_FROM join condition (see DRILL-2092 for an example). > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222555#comment-15222555 ] Venki Korukanti commented on DRILL-4539: Calcite's StandardConvertletTable is converting the {{IS_NOT_DISTINCT_FROM}} function into a {{CASE}} expression. Overriding it with a custom convertlet table, makes the query execute (query that has {{IS_NOT_DISTINCT_FROM}}) for both merge join and hash join. Changes are [here|https://github.com/vkorukanti/drill/commits/DRILL-4539]. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212645#comment-15212645 ] Aman Sinha commented on DRILL-4539: --- Actually, it looks like the 'condition' is TRUE when checkCartesianJoin() is run, so the join filter pushdown never happened from above the join. I just realized the plan does not even have the IS NOT DISTINCT FROM filter...it is getting converted to the more complex condition involving CASE statement which did not get pushed to the join. You may want to check why the original filter got changed.. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212566#comment-15212566 ] Aman Sinha commented on DRILL-4539: --- That's odd..we should not need to do NestedLoop join for this. Both HashJoin and MergeJoin support the 'is not distinct from' join condition. For an example of hashjoin where this gets used, see DRILL-2092. I think the JoinUtils.checkCartesianJoin() is incorrectly reporting that this is a non-equality join. Some changes may be needed there based on output of RelOptUtil.splitJoinCondition(). > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212256#comment-15212256 ] Jacques Nadeau commented on DRILL-4539: --- Quick note, that rewrite is not sufficient to correctly plan the query without cartesian join. To get it to run, we still have to do: {code} SET `planner.enable_nljoin_for_scalar_only` = false; {code} {code} explain plan for select t1.id, t1.name, t2.id, t2.name from dfs.`/opt/data/example.json` t1 join dfs.`/opt/data/example.json` t2 on t1.id is not distinct from t2.id {code} {code} 00-00Screen 00-01 Project(id=[$0], name=[$1], id0=[$2], name0=[$3]) 00-02Project(id=[$0], name=[$1], id0=[$2], name0=[$3]) 00-03 SelectionVectorRemover 00-04Filter(condition=[CAST(CASE(IS NULL($0), IS NULL($2), IS NULL($2), IS NULL($0), =($0, $2))):BOOLEAN NOT NULL]) 00-05 NestedLoopJoin(condition=[true], joinType=[inner]) 00-07Project(id=[$1], name=[$0]) 00-09 Scan(groupscan=[EasyGroupScan [selectionRoot=file:/opt/data/example.json, numFiles=1, columns=[`id`, `name`], files=[file:/opt/data/example.json]]]) 00-06Project(id0=[$0], name0=[$1]) 00-08 Project(id=[$1], name=[$0]) 00-10Scan(groupscan=[EasyGroupScan [selectionRoot=file:/opt/data/example.json, numFiles=1, columns=[`id`, `name`], files=[file:/opt/data/example.json]]]) {code} > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212245#comment-15212245 ] Zelaine Fong commented on DRILL-4539: - Just for clarity, the proposed rewritten syntax using IS NOT DISTINCT FROM would be: Original Query: select ... FROM t1, t2 WHERE t1.c1 = t2.c2 OR (t1.c1 IS NULL AND t2.c2 IS NULL) Rewritten Query: select ... FROM t1, t2 WHERE t1.c1 IS NOT DISTINCT FROM t2.c2 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211156#comment-15211156 ] Aman Sinha commented on DRILL-4539: --- The execution side is already available in Drill .. we added this a while ago (may have been pre 1.0). > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211052#comment-15211052 ] Jacques Nadeau commented on DRILL-4539: --- Thanks for these details. Does this mean the execution is already available or is my initial description still accurate? > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211045#comment-15211045 ] Aman Sinha commented on DRILL-4539: --- See here for the Drill support: https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/JoinUtils.java#L74 > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4539) Add support for Null Equality Joins
[ https://issues.apache.org/jira/browse/DRILL-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211041#comment-15211041 ] Aman Sinha commented on DRILL-4539: --- Yes, we should do this optimization. We should leverage the IS NOT DISTINCT FROM comparison which is SQL standard and supported by Calcite as well as Drill. This allows nulls to be compared as equal. > Add support for Null Equality Joins > --- > > Key: DRILL-4539 > URL: https://issues.apache.org/jira/browse/DRILL-4539 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jacques Nadeau >Assignee: Venki Korukanti > > Tableau frequently generates queries similar to this: > {code} > SELECT `t0`.`city` AS `city`, > `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`, > `t0`.`state` AS `state`, > `t0`.`sum_stars_ok` AS `sum_stars_ok` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > SUM(`business`.`stars`) AS `sum_stars_ok` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state` > ) `t0` > INNER JOIN ( > SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`, > `t1`.`city` AS `city`, > `t1`.`state` AS `state` > FROM ( > SELECT `business`.`city` AS `city`, > `business`.`state` AS `state`, > `business`.`business_id` AS `business_id`, > SUM(`business`.`stars`) AS `X_measure__A` > FROM `mongo.academic`.`business` `business` > GROUP BY `business`.`city`, > `business`.`state`, > `business`.`business_id` > ) `t1` > GROUP BY `t1`.`city`, > `t1`.`state` > ) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND > (`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` > IS NULL) AND (`t2`.`state` IS NULL > {code} > If you look at the join condition, you'll note that the join condition is an > equality condition which also allows null=null. We should add a planning > rewrite rule and execution join option to allow null equality so that we > don't treat this as a cartesian join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)