[jira] [Commented] (SPARK-21998) SortMergeJoinExec did not calculate its outputOrdering correctly during physical planning
[ https://issues.apache.org/jira/browse/SPARK-21998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172169#comment-16172169 ] Maryann Xue commented on SPARK-21998: - Thanks again for your comment, [~maropu]! I changed the title and description of this JIRA accordingly and created a PR as https://github.com/apache/spark/pull/19281. Could you please take a look? > SortMergeJoinExec did not calculate its outputOrdering correctly during > physical planning > - > > Key: SPARK-21998 > URL: https://issues.apache.org/jira/browse/SPARK-21998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Maryann Xue >Priority: Minor > > Right now the calculation of SortMergeJoinExec's outputOrdering relies on the > fact that its children have already been sorted on the join keys, while this > is often not true until EnsureRequirements has been applied. > {code} > /** >* For SMJ, child's output must have been sorted on key or expressions with > the same order as >* key, so we can get ordering for key from child's output ordering. >*/ > private def getKeyOrdering(keys: Seq[Expression], childOutputOrdering: > Seq[SortOrder]) > : Seq[SortOrder] = { > keys.zip(childOutputOrdering).map { case (key, childOrder) => > SortOrder(key, Ascending, childOrder.sameOrderExpressions + > childOrder.child - key) > } > } > {code} > Thus SortMergeJoinExec's outputOrdering is most likely not correct during the > physical planning stage, and as a result, potential physical optimizations > that rely on the required/output orderings, like SPARK-18591, will not work > for SortMergeJoinExec. > The right behavior of {{getKeyOrdering(keys, childOutputOrdering)}} should be: > 1. If the childOutputOrdering satisfies (is a superset of) the required child > ordering => childOutputOrdering > 2. Otherwise => required child ordering -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21998) SortMergeJoinExec did not calculate its outputOrdering correctly during physical planning
[ https://issues.apache.org/jira/browse/SPARK-21998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172166#comment-16172166 ] Apache Spark commented on SPARK-21998: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/19281 > SortMergeJoinExec did not calculate its outputOrdering correctly during > physical planning > - > > Key: SPARK-21998 > URL: https://issues.apache.org/jira/browse/SPARK-21998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Maryann Xue >Priority: Minor > > Right now the calculation of SortMergeJoinExec's outputOrdering relies on the > fact that its children have already been sorted on the join keys, while this > is often not true until EnsureRequirements has been applied. > {code} > /** >* For SMJ, child's output must have been sorted on key or expressions with > the same order as >* key, so we can get ordering for key from child's output ordering. >*/ > private def getKeyOrdering(keys: Seq[Expression], childOutputOrdering: > Seq[SortOrder]) > : Seq[SortOrder] = { > keys.zip(childOutputOrdering).map { case (key, childOrder) => > SortOrder(key, Ascending, childOrder.sameOrderExpressions + > childOrder.child - key) > } > } > {code} > Thus SortMergeJoinExec's outputOrdering is most likely not correct during the > physical planning stage, and as a result, potential physical optimizations > that rely on the required/output orderings, like SPARK-18591, will not work > for SortMergeJoinExec. > The right behavior of {{getKeyOrdering(keys, childOutputOrdering)}} should be: > 1. If the childOutputOrdering satisfies (is a superset of) the required child > ordering => childOutputOrdering > 2. Otherwise => required child ordering -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org