[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16757


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99869104
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAliasAndProjectSuite.scala
 ---
@@ -40,8 +45,8 @@ class RemoveAliasOnlyProjectSuite extends PlanTest with 
PredicateHelper {
 
   test("all expressions in project list are aliased child output but with 
different order") {
 val relation = LocalRelation('a.int, 'b.int)
-val query = relation.select('b as 'b, 'a as 'a).analyze
-val optimized = Optimize.execute(query)
+val query = relation.select('b, 'a).analyze
+val optimized = Optimize.execute(relation.select('b as 'b, 'a as 
'a).analyze)
--- End diff --

according to most of the optimizer unit tests, the preferred code style 
should be
```
val query = relation.select('b as 'b, 'a as 'a).analyze
val optimized = Optimize.execute(query)
val expected = relation.select('b, 'a).analyze
comparePlans(optimized, expected)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99528500
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,97 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
-}
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, blacklist ++ 
right.outputSet)
+val newRight = removeRedundantAliases(right, blacklist ++ 
newLeft.outputSet)
+val mapping = AttributeMap(
+  createAttributeMapping(left, newLeft) ++
+  createAttributeMapping(right, newRight))
+val newCondition = condition.map(_.transform {
+  case a: Attribute => mapping.getOrElse(a, a)
+})
+Join(newLeft, newRight, joinType, newCondition)
+
+  case _ =>
+// Remove redundant aliases in the subtree(s).
+val currentNextAttrPairs = mutable.Buffer.empty[(Attribute, 
Attribute)]
+val newNode = plan.mapChildren { child =>
+  val newChild = removeRedundantAliases(child, blacklist)
+  currentNextAttrPairs ++= createAttributeMapping(child, newChild)
+  newChild
+}
 
-aliasOnlyProject.map { case proj =>
-  val 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99528351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,97 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
-}
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, blacklist ++ 
right.outputSet)
+val newRight = removeRedundantAliases(right, blacklist ++ 
newLeft.outputSet)
+val mapping = AttributeMap(
+  createAttributeMapping(left, newLeft) ++
+  createAttributeMapping(right, newRight))
+val newCondition = condition.map(_.transform {
+  case a: Attribute => mapping.getOrElse(a, a)
+})
+Join(newLeft, newRight, joinType, newCondition)
+
+  case _ =>
+// Remove redundant aliases in the subtree(s).
+val currentNextAttrPairs = mutable.Buffer.empty[(Attribute, 
Attribute)]
+val newNode = plan.mapChildren { child =>
+  val newChild = removeRedundantAliases(child, blacklist)
+  currentNextAttrPairs ++= createAttributeMapping(child, newChild)
+  newChild
+}
 
-aliasOnlyProject.map { case proj =>
-  val 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99390416
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
+   */
+  private def getAliasCleaner(plan: LogicalPlan): (Expression, 
AttributeSet) => Expression = {
+plan match {
+  case _: Project => removeRedundantAlias
+  case _: Aggregate => removeRedundantAlias
+  case _: Window => removeRedundantAlias
+  case _ => (e, _) => e
 }
+  }
 
-aliasOnlyProject.map { case proj =>
-  val attributesToReplace = 
proj.output.zip(proj.child.output).filterNot {
-case (a1, a2) => a1 semanticEquals a2
-  }
-  val attrMap = AttributeMap(attributesToReplace)
-  plan transform {
-case plan: Project if plan eq proj => plan.child
-case plan => plan transformExpressions {
-  case a: Attribute if attrMap.contains(a) => attrMap(a)
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99385357
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
+   */
+  private def getAliasCleaner(plan: LogicalPlan): (Expression, 
AttributeSet) => Expression = {
+plan match {
+  case _: Project => removeRedundantAlias
+  case _: Aggregate => removeRedundantAlias
+  case _: Window => removeRedundantAlias
+  case _ => (e, _) => e
 }
+  }
 
-aliasOnlyProject.map { case proj =>
-  val attributesToReplace = 
proj.output.zip(proj.child.output).filterNot {
-case (a1, a2) => a1 semanticEquals a2
-  }
-  val attrMap = AttributeMap(attributesToReplace)
-  plan transform {
-case plan: Project if plan eq proj => plan.child
-case plan => plan transformExpressions {
-  case a: Attribute if attrMap.contains(a) => attrMap(a)
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99382442
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
--- End diff --

Yeah that is an improvement. I added all LogicalPlan nodes that are 
producing new attributes using named expressions.

I will inline this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99364785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
--- End diff --

Current is plan before we remove redundant aliases, and next is the plan 
after we have remove the redundant aliases. I'll update the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99363847
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
+   */
+  private def getAliasCleaner(plan: LogicalPlan): (Expression, 
AttributeSet) => Expression = {
+plan match {
+  case _: Project => removeRedundantAlias
+  case _: Aggregate => removeRedundantAlias
+  case _: Window => removeRedundantAlias
+  case _ => (e, _) => e
 }
+  }
 
-aliasOnlyProject.map { case proj =>
-  val attributesToReplace = 
proj.output.zip(proj.child.output).filterNot {
-case (a1, a2) => a1 semanticEquals a2
-  }
-  val attrMap = AttributeMap(attributesToReplace)
-  plan transform {
-case plan: Project if plan eq proj => plan.child
-case plan => plan transformExpressions {
-  case a: Attribute if attrMap.contains(a) => attrMap(a)
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99359002
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
+   */
+  private def getAliasCleaner(plan: LogicalPlan): (Expression, 
AttributeSet) => Expression = {
+plan match {
+  case _: Project => removeRedundantAlias
+  case _: Aggregate => removeRedundantAlias
+  case _: Window => removeRedundantAlias
+  case _ => (e, _) => e
 }
+  }
 
-aliasOnlyProject.map { case proj =>
-  val attributesToReplace = 
proj.output.zip(proj.child.output).filterNot {
-case (a1, a2) => a1 semanticEquals a2
-  }
-  val attrMap = AttributeMap(attributesToReplace)
-  plan transform {
-case plan: Project if plan eq proj => plan.child
-case plan => plan transformExpressions {
-  case a: Attribute if attrMap.contains(a) => attrMap(a)
+  /**
+   * Remove redundant alias expression from a LogicalPlan and its subtree. 
A blacklist is used to
+   * prevent the removal of seemingly redundant aliases used to 
deduplicate the input for a (self)
+   * join.
+   */
+  private def removeRedundantAliases(plan: LogicalPlan, blacklist: 
AttributeSet): LogicalPlan = {
+plan match {
+  // A join has to be treated differently, because the left and the 
right side of the join are
+  // not allowed to use the same attributes. We use a blacklist to 
prevent us from creating a
+  // situation in which this happens; the rule will only remove an 
alias if its child
+  // attribute is not on the black list.
+  case Join(left, right, joinType, condition) =>
+val newLeft = removeRedundantAliases(left, 

[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99358092
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
+  : Seq[(Attribute, Attribute)] = {
+current.output.zip(next.output).filterNot {
+  case (a1, a2) => a1.semanticEquals(a2)
 }
   }
 
-  private def stripAliasOnAttribute(projectList: Seq[NamedExpression]) = {
-projectList.map {
-  // Alias with metadata can not be stripped, or the metadata will be 
lost.
-  // If the alias name is different from attribute name, we can't 
strip it either, or we may
-  // accidentally change the output schema name of the root plan.
-  case a @ Alias(attr: Attribute, name) if a.metadata == 
Metadata.empty && name == attr.name =>
-attr
-  case other => other
-}
+  /**
+   * Remove the top-level alias from an expression when it is redundant.
+   */
+  private def removeRedundantAlias(e: Expression, blacklist: 
AttributeSet): Expression = e match {
+// Alias with metadata can not be stripped, or the metadata will be 
lost.
+// If the alias name is different from attribute name, we can't strip 
it either, or we
+// may accidentally change the output schema name of the root plan.
+case a @ Alias(attr: Attribute, name)
+  if a.metadata == Metadata.empty && name == attr.name && 
!blacklist.contains(attr) =>
+  attr
+case a => a
   }
 
-  def apply(plan: LogicalPlan): LogicalPlan = {
-val aliasOnlyProject = plan.collectFirst {
-  case p @ Project(pList, child) if isAliasOnly(pList, child.output) 
=> p
+  /**
+   * Get an appropriate alias cleaning method for the given node.
+   *
+   * We currently clean Project, Aggregate & Window nodes.
--- End diff --

so this is an improvement right? previously we only clean `Project`. 
However I think this method is over engineered, we can just create a `def 
needClean(plan: LogicalPlan): Boolean`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99357536
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
*/
-  private def isAliasOnly(
-  projectList: Seq[NamedExpression],
-  childOutput: Seq[Attribute]): Boolean = {
-if (projectList.length != childOutput.length) {
-  false
-} else {
-  stripAliasOnAttribute(projectList).zip(childOutput).forall {
-case (a: Attribute, o) if a semanticEquals o => true
-case _ => false
-  }
+  private def createAttributeMapping(current: LogicalPlan, next: 
LogicalPlan)
--- End diff --

can you explain what `current` and `next` means here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16757: [SPARK-18609][SPARK-18841][SQL] Fix redundant Ali...

2017-02-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16757#discussion_r99357432
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -154,56 +155,108 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Removes the Project only conducting Alias of its child node.
- * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
- * but can also benefit other operators.
+ * Remove redundant aliases from a query plan. A redundant alias is an 
alias that does not change
+ * the name or metadata of a column, and does not deduplicate it.
  */
-object RemoveAliasOnlyProject extends Rule[LogicalPlan] {
+object RemoveRedundantAliases extends Rule[LogicalPlan] {
+
   /**
-   * Returns true if the project list is semantically same as child 
output, after strip alias on
-   * attribute.
+   * Replace the attributes in an expression using the given mapping.
--- End diff --

looks like this doc is wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org