[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r81082573
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
+case u: Union =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
false), false)
+case Distinct(u: Union) =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
true), true)
 case _ => None
   }
 
   // Doing a depth-first tree traversal to combine all the union children.
   @tailrec
   private def collectUnionChildren(
   plans: mutable.Stack[LogicalPlan],
-  children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  children: Seq[LogicalPlan],
+  dedupDistinct: Boolean): Seq[LogicalPlan] = {
--- End diff --

@hvanhovell This looks good. I will use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80851860
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
+case u: Union =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
false), false)
+case Distinct(u: Union) =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
true), true)
 case _ => None
   }
 
   // Doing a depth-first tree traversal to combine all the union children.
   @tailrec
   private def collectUnionChildren(
   plans: mutable.Stack[LogicalPlan],
-  children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  children: Seq[LogicalPlan],
+  dedupDistinct: Boolean): Seq[LogicalPlan] = {
--- End diff --

@hvanhovell Just for keeping the original table order in the unions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80851313
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala
 ---
@@ -76,4 +77,32 @@ class SetOperationSuite extends PlanTest {
 testRelation3.select('g) :: Nil).analyze
 comparePlans(unionOptimized, unionCorrectAnswer)
   }
+
+  test("no more unnecessary distincts in multiple unions") {
+val query1 = OneRowRelation
+  .select(Literal(1).as('a))
+val query2 = OneRowRelation
+  .select(Literal(2).as('b))
+val query3 = OneRowRelation
+  .select(Literal(3).as('c))
+
+val unionQuery1 = Distinct(Union(Distinct(Union(query1, query2)), 
query3)).analyze
+val optimized1 = Optimize.execute(unionQuery1)
+val distinctUnionCorrectAnswer1 =
+  Distinct(Union(query1 :: query2 :: query3 :: Nil)).analyze
+comparePlans(distinctUnionCorrectAnswer1, optimized1)
+
+val unionQuery2 = Union(Distinct(Union(query1, query2)), 
query3).analyze
+val optimized2 = Optimize.execute(unionQuery2)
+val distinctUnionCorrectAnswer2 =
--- End diff --

yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80851344
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala
 ---
@@ -76,4 +77,32 @@ class SetOperationSuite extends PlanTest {
 testRelation3.select('g) :: Nil).analyze
 comparePlans(unionOptimized, unionCorrectAnswer)
   }
+
+  test("no more unnecessary distincts in multiple unions") {
+val query1 = OneRowRelation
+  .select(Literal(1).as('a))
+val query2 = OneRowRelation
+  .select(Literal(2).as('b))
+val query3 = OneRowRelation
+  .select(Literal(3).as('c))
+
+val unionQuery1 = Distinct(Union(Distinct(Union(query1, query2)), 
query3)).analyze
--- End diff --

OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80802568
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
+case u: Union =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
false), false)
+case Distinct(u: Union) =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
true), true)
 case _ => None
   }
 
   // Doing a depth-first tree traversal to combine all the union children.
   @tailrec
   private def collectUnionChildren(
   plans: mutable.Stack[LogicalPlan],
-  children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  children: Seq[LogicalPlan],
+  dedupDistinct: Boolean): Seq[LogicalPlan] = {
--- End diff --

Ok, so I took a longer look. The current approach uses tail recursion with 
a bunch of state. I tried rewriting this as a tail recursive function without 
that much state, but that proved to be very painful. So it seems simpler to 
just use a purely imperative function:
```scala
private def flattenUnion(union: Union, flattenDistinct: Boolean): Union = {
val stack = mutable.Stack[LogicalPlan](union)
val flattened = mutable.ArrayBuffer.empty[LogicalPlan]
while (stack.nonEmpty) {
  stack.pop() match {
case Distinct(Union(children)) if flattenDistinct =>
  stack.pushAll(children)
case Union(children) =>
  stack.pushAll(children)
case child =>
  flattened += child
  }
}
Union(flattened)
  }
```
@gatorsmile why did you use `reverseMap`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80794658
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
--- End diff --

+1

Probably simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80794597
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -579,8 +579,13 @@ object InferFiltersFromConstraints extends 
Rule[LogicalPlan] with PredicateHelpe
  * Combines all adjacent [[Union]] operators into a single [[Union]].
  */
 object CombineUnions extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-case Unions(children) => Union(children)
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+case Unions(children, isDistinct) =>
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80788735
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala
 ---
@@ -76,4 +77,32 @@ class SetOperationSuite extends PlanTest {
 testRelation3.select('g) :: Nil).analyze
 comparePlans(unionOptimized, unionCorrectAnswer)
   }
+
+  test("no more unnecessary distincts in multiple unions") {
+val query1 = OneRowRelation
+  .select(Literal(1).as('a))
+val query2 = OneRowRelation
+  .select(Literal(2).as('b))
+val query3 = OneRowRelation
+  .select(Literal(3).as('c))
+
+val unionQuery1 = Distinct(Union(Distinct(Union(query1, query2)), 
query3)).analyze
+val optimized1 = Optimize.execute(unionQuery1)
+val distinctUnionCorrectAnswer1 =
+  Distinct(Union(query1 :: query2 :: query3 :: Nil)).analyze
+comparePlans(distinctUnionCorrectAnswer1, optimized1)
+
+val unionQuery2 = Union(Distinct(Union(query1, query2)), 
query3).analyze
+val optimized2 = Optimize.execute(unionQuery2)
+val distinctUnionCorrectAnswer2 =
+  Union(Distinct(Union(query1 :: query2 :: Nil)) :: query3 :: 
Nil).analyze
+comparePlans(distinctUnionCorrectAnswer2, optimized2)
+
+val unionQuery3 = Distinct(Union(Union(query1, query2),
+  Distinct(Union(query2, query3.analyze
--- End diff --

```
query1
|
D - U - U - query2
|
D - U - query2
|
query3
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80787675
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala
 ---
@@ -76,4 +77,32 @@ class SetOperationSuite extends PlanTest {
 testRelation3.select('g) :: Nil).analyze
 comparePlans(unionOptimized, unionCorrectAnswer)
   }
+
+  test("no more unnecessary distincts in multiple unions") {
+val query1 = OneRowRelation
+  .select(Literal(1).as('a))
+val query2 = OneRowRelation
+  .select(Literal(2).as('b))
+val query3 = OneRowRelation
+  .select(Literal(3).as('c))
+
+val unionQuery1 = Distinct(Union(Distinct(Union(query1, query2)), 
query3)).analyze
+val optimized1 = Optimize.execute(unionQuery1)
+val distinctUnionCorrectAnswer1 =
+  Distinct(Union(query1 :: query2 :: query3 :: Nil)).analyze
+comparePlans(distinctUnionCorrectAnswer1, optimized1)
+
+val unionQuery2 = Union(Distinct(Union(query1, query2)), 
query3).analyze
--- End diff --

```
U - D - U - query1
|   |
query3  query2
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80784941
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
+case u: Union =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
false), false)
+case Distinct(u: Union) =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
true), true)
 case _ => None
   }
 
   // Doing a depth-first tree traversal to combine all the union children.
   @tailrec
   private def collectUnionChildren(
   plans: mutable.Stack[LogicalPlan],
-  children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  children: Seq[LogicalPlan],
+  dedupDistinct: Boolean): Seq[LogicalPlan] = {
--- End diff --

All the tail recursion can be converted to a while loop. When I wrote this 
at that moment, I thought tail recursion was more welcomed than a while loop. : 
) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80753108
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
+case u: Union =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
false), false)
+case Distinct(u: Union) =>
+  Some(collectUnionChildren(mutable.Stack(u), Seq.empty[LogicalPlan], 
true), true)
 case _ => None
   }
 
   // Doing a depth-first tree traversal to combine all the union children.
   @tailrec
   private def collectUnionChildren(
   plans: mutable.Stack[LogicalPlan],
-  children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  children: Seq[LogicalPlan],
+  dedupDistinct: Boolean): Seq[LogicalPlan] = {
--- End diff --

Why is this method so complicated? It seems that we can do without the 
stack. A stack only makes sense if you do not want use recursion (you can use a 
while loop instead). This comment has nothing to do with this PR, but with the 
code as it was to begin with. Could you fix it anyway?

cc @gatorsmile as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80750444
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -191,25 +191,34 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 
 /**
  * A pattern that collects all adjacent unions and returns their children 
as a Seq.
+ * If the top union is wrapped in a [[Distinct]], then the [[Distinct]] in 
the adjacent unions, if
+ * any, will be eliminated.
  */
 object Unions {
-  def unapply(plan: LogicalPlan): Option[Seq[LogicalPlan]] = plan match {
-case u: Union => Some(collectUnionChildren(mutable.Stack(u), 
Seq.empty[LogicalPlan]))
+  def unapply(plan: LogicalPlan): Option[(Seq[LogicalPlan], Boolean)] = 
plan match {
--- End diff --

Why do we use an extractor here? We could just move this into the rule 
right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-26 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80593959
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala
 ---
@@ -76,4 +77,32 @@ class SetOperationSuite extends PlanTest {
 testRelation3.select('g) :: Nil).analyze
 comparePlans(unionOptimized, unionCorrectAnswer)
   }
+
+  test("no more unnecessary distincts in multiple unions") {
+val query1 = OneRowRelation
+  .select(Literal(1).as('a))
+val query2 = OneRowRelation
+  .select(Literal(2).as('b))
+val query3 = OneRowRelation
+  .select(Literal(3).as('c))
+
+val unionQuery1 = Distinct(Union(Distinct(Union(query1, query2)), 
query3)).analyze
+val optimized1 = Optimize.execute(unionQuery1)
+val distinctUnionCorrectAnswer1 =
+  Distinct(Union(query1 :: query2 :: query3 :: Nil)).analyze
+comparePlans(distinctUnionCorrectAnswer1, optimized1)
+
+val unionQuery2 = Union(Distinct(Union(query1, query2)), 
query3).analyze
+val optimized2 = Optimize.execute(unionQuery2)
+val distinctUnionCorrectAnswer2 =
--- End diff --

In other words, distinctUnionCorrectAnswer2 = unionQuery2, right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-26 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/15238#discussion_r80593053
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -579,8 +579,13 @@ object InferFiltersFromConstraints extends 
Rule[LogicalPlan] with PredicateHelpe
  * Combines all adjacent [[Union]] operators into a single [[Union]].
  */
 object CombineUnions extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-case Unions(children) => Union(children)
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+case Unions(children, isDistinct) =>
--- End diff --

Could we do Unions(children, true) and Unions(children, false) as separate 
case statements, that is separate pattern matching ? That would make it more 
clear.
Like a UnionAll and a UnionDistinct ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-25 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/15238

[SPARK-17653][SQL] Remove unnecessary distincts in multiple unions

## What changes were proposed in this pull request?

Currently for `Union [Distinct]`, a `Distinct` operator is necessary to be 
on the top of `Union`. Once there are adjacent `Union [Distinct]`,  there will 
be multiple `Distinct` in the query plan.

E.g.,

For a query like: select 1 a union select 2 b union select 3 c

Before this patch, its physical plan looks like:

*HashAggregate(keys=[a#13], functions=[])
+- Exchange hashpartitioning(a#13, 200)
   +- *HashAggregate(keys=[a#13], functions=[])
  +- Union
 :- *HashAggregate(keys=[a#13], functions=[])
 :  +- Exchange hashpartitioning(a#13, 200)
 : +- *HashAggregate(keys=[a#13], functions=[])
 :+- Union
 :   :- *Project [1 AS a#13]
 :   :  +- Scan OneRowRelation[]
 :   +- *Project [2 AS b#14]
 :  +- Scan OneRowRelation[]
 +- *Project [3 AS c#15]
+- Scan OneRowRelation[]

Only the top distinct should be necessary.

After this patch, the physical plan looks like:

*HashAggregate(keys=[a#221], functions=[], output=[a#221])
+- Exchange hashpartitioning(a#221, 5)
   +- *HashAggregate(keys=[a#221], functions=[], output=[a#221])
  +- Union
 :- *Project [1 AS a#221]
 :  +- Scan OneRowRelation[]
 :- *Project [2 AS b#222]
 :  +- Scan OneRowRelation[]
 +- *Project [3 AS c#223]
+- Scan OneRowRelation[]

## How was this patch tested?

Jenkins tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 remove-extra-distinct-union

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15238


commit c770a9a9948c301a831daa555360702c73542aa2
Author: Liang-Chi Hsieh 
Date:   2016-09-26T03:37:46Z

Remove unnecessary distincts in multiple unions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org