[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21072


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183399564
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
--- End diff --

yes, but I saw that `transfrom` actually does `transformDown`. Anyway, I 
see that this might change and here we best have `transformDown`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183392394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
--- End diff --

assume the plan is 
```
Sort
  Filter
Sort
  Filter
Sort
  OtherPlan
```

If we do `transformUp`, we hit the rule 3 times, which has some unnecessary 
transformation. If it's `transformDown`, it's one-shot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183385373
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
--- End diff --

isn't it the same?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183382434
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,29 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
--- End diff --

nit: now it's more efficient to do `transformDown`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183376362
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, child) => s.copy(child = 
recursiveRemoveSort(child))
+  }
+
+  def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {
+case Project(fields, child) => Project(fields, 
recursiveRemoveSort(child))
+case Filter(condition, child) => Filter(condition, 
recursiveRemoveSort(child))
--- End diff --

by the definition of `deterministic`, the entire input is the stats of the 
expression. It's very likely we will get a different result if we remove sort 
before filter, e.g. `rowId() < 10` will get the first 10 rows, if you sort the 
input, the first 10 rows changed.

I think we should be conservative about deterministic expressions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183311275
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, child) => s.copy(child = 
recursiveRemoveSort(child))
+  }
+
+  def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {
+case Project(fields, child) => Project(fields, 
recursiveRemoveSort(child))
+case Filter(condition, child) => Filter(condition, 
recursiveRemoveSort(child))
--- End diff --

why do you think we should check for the filter condition and the projected 
items to be deterministic?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r183272597
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter 
operators
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, child) => s.copy(child = 
recursiveRemoveSort(child))
+  }
+
+  def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {
+case Project(fields, child) => Project(fields, 
recursiveRemoveSort(child))
+case Filter(condition, child) => Filter(condition, 
recursiveRemoveSort(child))
--- End diff --

we should at least add `ResolvedHint`. To easily expand the white list in 
the future, I'd like to change the code style to
```
def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {
  case s: Sort => recursiveRemoveSort(s.child)
  case other if canEliminateSort(other) => 
other.withNewChildren(other.children.map(recursiveRemoveSort))
  case _ => plan
}

def canEliminateSort(plan: LogicalPlan): Boolean = plan match {
  case p: Project => p.projectList.forall(_.deterministic)
  case f: Filter => f.condition.deterministic
  case _: ResolvedHint => true
  ...
  case _ => false
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-16 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181847445
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala
 ---
@@ -98,4 +98,31 @@ class RemoveRedundantSortsSuite extends PlanTest {
 val correctAnswer = groupedAndResorted.analyze
 comparePlans(optimized, correctAnswer)
   }
+
+  test("remove two consecutive sorts") {
+val orderedTwice = testRelation.orderBy('a.asc).orderBy('b.desc)
+val optimized = Optimize.execute(orderedTwice.analyze)
+val correctAnswer = testRelation.orderBy('b.desc).analyze
+comparePlans(optimized, correctAnswer)
+  }
--- End diff --

Can you add a test for three consecutive sorts? Two is the base case, three 
will help us show the inductive case :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-16 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181851593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,22 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if the there is another Sort operator separated by 0...n 
Project/Filter operators
--- End diff --

nit: 'the there'


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-16 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181849951
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala
 ---
@@ -98,4 +98,31 @@ class RemoveRedundantSortsSuite extends PlanTest {
 val correctAnswer = groupedAndResorted.analyze
 comparePlans(optimized, correctAnswer)
   }
+
--- End diff --

Could you add a test which explicitly confirms that sort.limit.sort is not 
simplified? I know the above two tests cover that case, but it's good to have 
one dedicated to testing this important property. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-16 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181643988
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if the next operator is a Sort itself
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child)
--- End diff --

yes, it makes sense. I will do, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-14 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21072#discussion_r181563918
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -736,12 +736,15 @@ object EliminateSorts extends Rule[LogicalPlan] {
 }
 
 /**
- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if the next operator is a Sort itself
  */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case Sort(orders, true, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   child
+case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child)
--- End diff --

Thanks for doing this! It might be useful to generalise this to any pair of 
sorts separated by 0 or more projections or filters. I did this for my 
SPARK-23975 PR, see: 
https://github.com/henryr/spark/commit/bb992c2058863322a9183b2985806a87729e4168#diff-a636a87d8843eeccca90140be91d4fafR322

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21072: [SPARK-23973][SQL] Remove consecutive Sorts

2018-04-14 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21072

[SPARK-23973][SQL] Remove consecutive Sorts

## What changes were proposed in this pull request?

In SPARK-23375 we introduced the ability of removing `Sort` operation 
during query optimization if the data is already sorted. In this follow-up we 
remove also a `Sort` which is followed by another `Sort`: in this case the 
first sort is not needed and can be safely removed.

The PR starts from @henryr's comment: 
https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit 
should be given to him.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23973

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21072


commit 6ba418619dcfd139b6b3fa81cf8a29d54ca62681
Author: Marco Gaido 
Date:   2018-04-14T10:11:27Z

[SPARK-23973][SQL] Remove consecutive Sorts




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org