[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-09-27 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14899
  
I am closing this PR. I will propose a general solution in SPARK-17154.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14899
  
**[Test build #3240 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3240/consoleFull)**
 for PR 14899 at commit 
[`d414a78`](https://github.com/apache/spark/commit/d414a78bc9489860ff755e0949dfda9f23bd6f6e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14899
  
Do you think I should keep the title of the PR as generic as it is? Feel 
free to suggest a specific title if you still think it will more appropriately 
describe the problem we are trying to fix here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14899
  
@hvanhovell, thanks for your feedback.

I thought about narrowing the scope of my PR to just the subquery alias 
context. That would solve the problem I used in this PR. At first, I hesitated 
to fix it at this fundamental level. I then stepped back and thought about a 
bigger picture, from a semantics representation point of view. If a SQL 
statement (or an equivalence expressed in DataFrame/Dataset APIs) references a 
relation in more than once (says, two), we need to have a way to identify the 
two references of the same relation. This is done in Spark by having a unique 
(long int) identifier associated to a column/an expression of a relation. 
Correct me if I'm wrong, the implementation is in the `exprId` value of the 
`case class Alias`, which is implemented by calling the definition `newExprId`, 
which increments (by 1) for each time it is called. With this implementation, 
theoretically speaking, for any column/expression, at any level of a logical 
plan, we should be able to identify which stream the column is from (think
  of a relation is a vertex and a reference to a relation is an edge, we can 
have two edges come from the same vertex; then at the end of the two edges are 
the same column, say C1, but one comes from one stream (or edge) and the other 
comes from the second stream). This is the thinking behind my PR.  So limiting 
the scope to just the subquery alias may leave holes for future problems when 
more rewrite and optimization rules are getting fancier. We just can't 
guarantee two different parts of a SQL statement that look unable to merge into 
the same subquery block at the Analysis phase may merge by new rules and the 
problem of the name collision will surface.

In fact, I am keen to remove the `dedupRight` code completely as I view it 
as a hacky way of resolving name collision. I can't as one of the test cases 
fails because of that. I think it is a test case where it expresses a 
(self-)join on the same relation using DataFrame APIs written in Scala. I never 
be able to produce it using SQL directly. I think it is because in SQL, it will 
always construct the `Project` operator on top of a relation and hence my PR 
shields it.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14899
  
**[Test build #3240 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3240/consoleFull)**
 for PR 14899 at commit 
[`d414a78`](https://github.com/apache/spark/commit/d414a78bc9489860ff755e0949dfda9f23bd6f6e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14899
  
@nsyca can you also make the title a bit more descriptive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14899
  
@nsyca this is a good catch.

Technically the analyzer creates a correctly analyzed plan. It is the 
optimizer that breaks it.  However I do think we need to fix this in the 
analyzer.

Your current approach is to alias every project in a subquery alias, that 
happens in a lot of places. We can be a bit more subtle here, by focussing on 
the subquery expression. We could just generalize the code in the 
`rewriteSubQuery` to always add a project. Could you take a look at this?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14899
  
With the fix in this PR, the Analyzed Logical Plan and Optimized Logical 
Plan become:

```
== Analyzed Logical Plan ==
c3: int
Project [c3#123]
+- Filter NOT predicate-subquery#124 [(c3#123 = c2#133)]
   :  +- Project [c2#133]
   : +- SubqueryAlias t2
   :+- Project [value#107 AS c2#133]
   :   +- LocalRelation [value#107]
   +- SubqueryAlias t3
  +- Project [(c2#132 + 1) AS c3#123]
 +- Join LeftOuter, (c1#131 = c2#132)
:- SubqueryAlias t1
:  +- Project [value#100 AS c1#131]
: +- LocalRelation [value#100]
+- SubqueryAlias t2
   +- Project [value#107 AS c2#132]
  +- LocalRelation [value#107]

== Optimized Logical Plan ==
Project [(c2#132 + 1) AS c3#123]
+- Join LeftAnti, (isnull(((c2#132 + 1) = c2#133)) || ((c2#132 + 1) = 
c2#133))
   :- Join LeftOuter, (c1#131 = c2#132)
   :  :- LocalRelation [c1#131]
   :  +- LocalRelation [c2#132]
   +- LocalRelation [c2#133]
```

Here, the code generates a new `c2#133` for the T2 referenced in the NOT IN 
predicate in the Analyzed Logical Plan. With now the two different `c2`, the 
`PushPredicateThroughJoin` rule recognizes that the equality predicate `(c2#132 
+ 1) = c2#133 [nullAware]` is not a so-called local predicate (predicate 
referencing columns from a single relation) anymore and hence the predicate 
(or, more precise, the LeftAnti operator) is not pushed down to the right table 
`t2` of the LeftOuter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread nsyca
Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/14899
  
Some background on my investigation of the problem:

How can we do name resolution in an SQL statement?

```
scala> sql("select * from t1, t1 t2").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner
   :- 'UnresolvedRelation `t1`
   +- 'UnresolvedRelation `t1`, t2

== Analyzed Logical Plan ==
c1: int, c1: int
Project [c1#3, c1#147]
+- Join Inner
   :- SubqueryAlias t1
   :  +- Project [value#1 AS c1#3]
   : +- LocalRelation [value#1]
   +- SubqueryAlias t2
  +- SubqueryAlias t1
 +- Project [value#1 AS c1#147]
+- LocalRelation [value#1]
```

Here the second reference to the column c1 is changed to c1#147 by the 
private method `dedupRight`  in object `ResolveReferences` in Analyzer.scala. 
The problem is `dedupRight` is only called in 3 contexts, Join, Intersect, and 
Except.

```
  // To resolve duplicate expression IDs for Join and Intersect
  case j @ Join(left, right, _, _) if !j.duplicateResolved =>
j.copy(right = dedupRight(left, right))
  case i @ Intersect(left, right) if !i.duplicateResolved =>
i.copy(right = dedupRight(left, right))
  case i @ Except(left, right) if !i.duplicateResolved =>
i.copy(right = dedupRight(left, right))
```

The assumption of this name resolution algorithm is any name conflicts 
outside of the 3 aforementioned contexts are okay. Those name conflicts will 
never mix with one another to create any ambiguity. This is certainly not the 
case in the example below.

```
Seq(1,2).toDF("c1").createOrReplaceTempView("t1")
Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").show

++
|  c3|
++
|   2|
|null|
++
```

The correct answer is 1 row of (2). From the plan below, the incorrect 
portion of the plan is the LeftAnti, rewritten from the NOT IN subquery, is 
pushed down below the (T1 LOJ T2) operation. Because LeftAnti predicate is 
evaluated to unknown (or null) if any argument of a comparison operator in the 
predicate is null, e.g. NULL =  is evaluated to unknown (which is 
equivalent to false in the context of a predicate), the LeftAnti predicate 
cannot be pushed down into a LOJ operation.

The logical plan looks like:

```
scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT 'c3 IN (list#124)
   :  +- 'SubqueryAlias list#124
   : +- 'Project ['c2]
   :+- 'UnresolvedRelation `t2`
   +- 'SubqueryAlias t3
  +- 'Project [('t2.c2 + 1) AS c3#123]
 +- 'Join LeftOuter, ('t1.c1 = 't2.c2)
:- 'UnresolvedRelation `t1`
+- 'UnresolvedRelation `t2`

== Analyzed Logical Plan ==
c3: int
Project [c3#123]
+- Filter NOT predicate-subquery#124 [(c3#123 = c2#77)]
   :  +- SubqueryAlias predicate-subquery#124 [(c3#123 = c2#77)]
   : +- Project [c2#77]
   :+- SubqueryAlias t2
   :   +- Project [value#75 AS c2#77]
   :  +- LocalRelation [value#75]
   +- SubqueryAlias t3
  +- Project [(c2#77 + 1) AS c3#123]
 +- Join LeftOuter, (c1#3 = c2#77)
:- SubqueryAlias t1
:  +- Project [value#1 AS c1#3]
: +- LocalRelation [value#1]
+- SubqueryAlias t2
   +- Project [value#75 AS c2#77]
  +- LocalRelation [value#75]
...
```
The culprit is the column `c2#77` from the two references of `SubqueryAlias 
t2` are from two different contexts. They cannot share the same name. The 
problem is subtle at this point from the derived column `c3#123` of the 
expression `(c2#77 + 1)` but it becomes obvious when looking at the Optimized 
Logical Plan (below):

```
== Optimized Logical Plan ==
Project [(c2#77 + 1) AS c3#123]
+- Join LeftOuter, (c1#3 = c2#77)
   :- Project [value#1 AS c1#3]
   :  +- Join LeftAnti, (isnull(((c2#77 + 1) = c2#77)) || ((c2#77 + 1) = 
c2#77))
   : :- LocalRelation [value#1]
   : +- LocalRelation [c2#77]
   +- LocalRelation [c2#77]
```

Given more and more rewrite rules can be/will be added in the future, it is 
impossible to ensure that any name conflicts in any part of the same SQL 
statement will not collide as the logical plan gets rewritten. I think a solid 
solution is to generate a unique name for any column from any base relation.




---
If your project 

[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

2016-08-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14899
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org