[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21072 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183399564 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { --- End diff -- yes, but I saw that `transfrom` actually does `transformDown`. Anyway, I see that this might change and here we best have `transformDown` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183392394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { --- End diff -- assume the plan is ``` Sort Filter Sort Filter Sort OtherPlan ``` If we do `transformUp`, we hit the rule 3 times, which has some unnecessary transformation. If it's `transformDown`, it's one-shot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183385373 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { --- End diff -- isn't it the same? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183382434 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { --- End diff -- nit: now it's more efficient to do `transformDown` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183376362 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child)) + } + + def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { +case Project(fields, child) => Project(fields, recursiveRemoveSort(child)) +case Filter(condition, child) => Filter(condition, recursiveRemoveSort(child)) --- End diff -- by the definition of `deterministic`, the entire input is the stats of the expression. It's very likely we will get a different result if we remove sort before filter, e.g. `rowId() < 10` will get the first 10 rows, if you sort the input, the first 10 rows changed. I think we should be conservative about deterministic expressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183311275 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child)) + } + + def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { +case Project(fields, child) => Project(fields, recursiveRemoveSort(child)) +case Filter(condition, child) => Filter(condition, recursiveRemoveSort(child)) --- End diff -- why do you think we should check for the filter condition and the projected items to be deterministic? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r183272597 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if there is another Sort operator separated by 0...n Project/Filter operators */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child)) + } + + def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { +case Project(fields, child) => Project(fields, recursiveRemoveSort(child)) +case Filter(condition, child) => Filter(condition, recursiveRemoveSort(child)) --- End diff -- we should at least add `ResolvedHint`. To easily expand the white list in the future, I'd like to change the code style to ``` def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { case s: Sort => recursiveRemoveSort(s.child) case other if canEliminateSort(other) => other.withNewChildren(other.children.map(recursiveRemoveSort)) case _ => plan } def canEliminateSort(plan: LogicalPlan): Boolean = plan match { case p: Project => p.projectList.forall(_.deterministic) case f: Filter => f.condition.deterministic case _: ResolvedHint => true ... case _ => false } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181847445 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala --- @@ -98,4 +98,31 @@ class RemoveRedundantSortsSuite extends PlanTest { val correctAnswer = groupedAndResorted.analyze comparePlans(optimized, correctAnswer) } + + test("remove two consecutive sorts") { +val orderedTwice = testRelation.orderBy('a.asc).orderBy('b.desc) +val optimized = Optimize.execute(orderedTwice.analyze) +val correctAnswer = testRelation.orderBy('b.desc).analyze +comparePlans(optimized, correctAnswer) + } --- End diff -- Can you add a test for three consecutive sorts? Two is the base case, three will help us show the inductive case :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181851593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if the there is another Sort operator separated by 0...n Project/Filter operators --- End diff -- nit: 'the there' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181849951 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala --- @@ -98,4 +98,31 @@ class RemoveRedundantSortsSuite extends PlanTest { val correctAnswer = groupedAndResorted.analyze comparePlans(optimized, correctAnswer) } + --- End diff -- Could you add a test which explicitly confirms that sort.limit.sort is not simplified? I know the above two tests cover that case, but it's good to have one dedicated to testing this important property. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181643988 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if the next operator is a Sort itself */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child) --- End diff -- yes, it makes sense. I will do, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21072#discussion_r181563918 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] { } /** - * Removes Sort operation if the child is already sorted + * Removes redundant Sort operation. This can happen: + * 1) if the child is already sorted + * 2) if the next operator is a Sort itself */ object RemoveRedundantSorts extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) => child +case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child) --- End diff -- Thanks for doing this! It might be useful to generalise this to any pair of sorts separated by 0 or more projections or filters. I did this for my SPARK-23975 PR, see: https://github.com/henryr/spark/commit/bb992c2058863322a9183b2985806a87729e4168#diff-a636a87d8843eeccca90140be91d4fafR322 What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21072 [SPARK-23973][SQL] Remove consecutive Sorts ## What changes were proposed in this pull request? In SPARK-23375 we introduced the ability of removing `Sort` operation during query optimization if the data is already sorted. In this follow-up we remove also a `Sort` which is followed by another `Sort`: in this case the first sort is not needed and can be safely removed. The PR starts from @henryr's comment: https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit should be given to him. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23973 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21072 commit 6ba418619dcfd139b6b3fa81cf8a29d54ca62681 Author: Marco GaidoDate: 2018-04-14T10:11:27Z [SPARK-23973][SQL] Remove consecutive Sorts --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org