[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-02-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16255


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r96110941
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

I know how the failure happens and this can fix it, but I think it's too 
hacky and does not catch the root cause.  
https://github.com/apache/spark/pull/16255/files#r92348878 easily explains why 
the failure happens and how to fix it, can you make other people understand 
your fix easily?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-13 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r95981143
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

RemoveAliasOnlyProject will remove  `alias a#1 to a#2`, and `replace all 
a#2 with a#1`, so there is no `a#2` exists, If we do nothing for `alias a#1 to 
a#2`(not the same object with the removed one), it will cause the exception 
situation from `step 5` to `step 6` showed on the above comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r95300466
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

I don't get it, for `alias a#1 to a#2`, we wanna `replace all a#2 with 
a#1`, so we will do nothing for `alias a#1 to a#2`, because we can't find an 
attribute `a#2`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-08 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r95095416
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

As you said, if we find an alias-only project, e.g. alias a#1 to a#2, it's 
safe to `remove this project` and `replace all a#2 with a#1` in this plan. So 
another Alias which is also alias a#1 to a#2, but not the same object with the 
first one, it will not be processed. 

here, the logic shows that we process the situation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r94361717
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

How do you reason about this? why we treat `Alias` differently here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r94361709
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

How do you reason about this? why we treat `Alias` differently here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r93892249
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

sorry, describe it clearly,this is not safe? 
https://github.com/windpiger/spark/blob/0413f9dad4ad1294e3400dc0f42f66529b1b055b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L203-L204



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r93841751
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

what do you mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-14 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92359444
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

it is safe to only replace attributes in the ancestor nodes.
Alias with the same exprId but not the same object, replace the alias with 
it's child. it is not safe ,right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92349611
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

a naive way to do this is, make sure we only replace attributes in the 
ancestor nodes of the alias-only project:
```
plan transform {
  case plan: Project if plan eq proj => plan.child
  case plan if plan.collect { case p if p eq project }.nonEmpty => // do 
the replace
}
```

It's very inefficient, maybe we can improve `TreeNode` to maintain the 
parent-child relationship between nodes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92348878
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

The problem is: this rule assumes that, if we find an alias-only project, 
e.g. alias a#1 to a#2, it's safe to remove this project and replace all a#2 
with a#1 in this plan. However, this is not true for complex cases like 
https://github.com/apache/spark/pull/16255/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R2023
 .

Let's see if there is a way to fix this problem entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-13 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92311295
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -416,8 +418,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
 case w: Window if w.windowExpressions.isEmpty => w.child
 
 // Eliminate no-op Projects
-case p @ Project(_, child) if sameOutput(child.output, p.output) => 
child
-
+case p @ Project(_, child) if sameOutput(child.output, p.output) =>
+  if (child.isInstanceOf[CatalogRelation]) p else child
--- End diff --

yes, it will transform the right side of the join, and the problem happened 
when replace the right side of the join. Above step 5, it apply 
RemoveAliasOnlyProject rule on step 4's plan, which replace col#6 with col#9, 
but it does not have effect on `Project [col#9 AS col#6]` of the right side of 
the join. I think the right side `Project [col#9 AS col#6]` should be changed 
to `Project [col#9]`
```
Project [col#9 AS c1#4, col#10 AS c2#5]
+- Join Cross
:- Join Cross
: :- Project
: : +- MetastoreRelation default, p1
: +- MetastoreRelation default, p2
+- !Project [col#9 AS col#10]
+- Join Cross
:- Project
: +- MetastoreRelation default, p1
+- Project [col#9 AS col#6]
+- MetastoreRelation default, p2
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-13 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92293834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -416,8 +418,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
 case w: Window if w.windowExpressions.isEmpty => w.child
 
 // Eliminate no-op Projects
-case p @ Project(_, child) if sameOutput(child.output, p.output) => 
child
-
+case p @ Project(_, child) if sameOutput(child.output, p.output) =>
+  if (child.isInstanceOf[CatalogRelation]) p else child
--- End diff --

This works in your example case, but will fail for a more complex query.

The problem is that the transform on L199 also modifies parts of the tree, 
that are totally unrelated to the projection at hand. In this case right side 
of the join. This can currently only happen in case of a self-join on a CTE, 
such a tree can contain duplicate attribute ids because we expand the tree 
after we have analyzed the CTE.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-13 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92245640
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
--- End diff --

 Nevermind.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-13 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92197200
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
--- End diff --

It looks like semanticEquals might be broken (this should be, but perhaps 
this is for a reason. @cloud-fan any idea?

Please use a proper pattern match here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-13 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r92186722
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -416,8 +418,8 @@ object ColumnPruning extends Rule[LogicalPlan] {
 case w: Window if w.windowExpressions.isEmpty => w.child
 
 // Eliminate no-op Projects
-case p @ Project(_, child) if sameOutput(child.output, p.output) => 
child
-
+case p @ Project(_, child) if sameOutput(child.output, p.output) =>
+  if (child.isInstanceOf[CatalogRelation]) p else child
--- End diff --

Why is a CatalogRelation different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org