[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14899 I am closing this PR. I will propose a general solution in SPARK-17154. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14899 **[Test build #3240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3240/consoleFull)** for PR 14899 at commit [`d414a78`](https://github.com/apache/spark/commit/d414a78bc9489860ff755e0949dfda9f23bd6f6e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14899 Do you think I should keep the title of the PR as generic as it is? Feel free to suggest a specific title if you still think it will more appropriately describe the problem we are trying to fix here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14899 @hvanhovell, thanks for your feedback. I thought about narrowing the scope of my PR to just the subquery alias context. That would solve the problem I used in this PR. At first, I hesitated to fix it at this fundamental level. I then stepped back and thought about a bigger picture, from a semantics representation point of view. If a SQL statement (or an equivalence expressed in DataFrame/Dataset APIs) references a relation in more than once (says, two), we need to have a way to identify the two references of the same relation. This is done in Spark by having a unique (long int) identifier associated to a column/an expression of a relation. Correct me if I'm wrong, the implementation is in the `exprId` value of the `case class Alias`, which is implemented by calling the definition `newExprId`, which increments (by 1) for each time it is called. With this implementation, theoretically speaking, for any column/expression, at any level of a logical plan, we should be able to identify which stream the column is from (think of a relation is a vertex and a reference to a relation is an edge, we can have two edges come from the same vertex; then at the end of the two edges are the same column, say C1, but one comes from one stream (or edge) and the other comes from the second stream). This is the thinking behind my PR. So limiting the scope to just the subquery alias may leave holes for future problems when more rewrite and optimization rules are getting fancier. We just can't guarantee two different parts of a SQL statement that look unable to merge into the same subquery block at the Analysis phase may merge by new rules and the problem of the name collision will surface. In fact, I am keen to remove the `dedupRight` code completely as I view it as a hacky way of resolving name collision. I can't as one of the test cases fails because of that. I think it is a test case where it expresses a (self-)join on the same relation using DataFrame APIs written in Scala. I never be able to produce it using SQL directly. I think it is because in SQL, it will always construct the `Project` operator on top of a relation and hence my PR shields it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14899 **[Test build #3240 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3240/consoleFull)** for PR 14899 at commit [`d414a78`](https://github.com/apache/spark/commit/d414a78bc9489860ff755e0949dfda9f23bd6f6e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14899 @nsyca can you also make the title a bit more descriptive? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14899 @nsyca this is a good catch. Technically the analyzer creates a correctly analyzed plan. It is the optimizer that breaks it. However I do think we need to fix this in the analyzer. Your current approach is to alias every project in a subquery alias, that happens in a lot of places. We can be a bit more subtle here, by focussing on the subquery expression. We could just generalize the code in the `rewriteSubQuery` to always add a project. Could you take a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14899 With the fix in this PR, the Analyzed Logical Plan and Optimized Logical Plan become: ``` == Analyzed Logical Plan == c3: int Project [c3#123] +- Filter NOT predicate-subquery#124 [(c3#123 = c2#133)] : +- Project [c2#133] : +- SubqueryAlias t2 :+- Project [value#107 AS c2#133] : +- LocalRelation [value#107] +- SubqueryAlias t3 +- Project [(c2#132 + 1) AS c3#123] +- Join LeftOuter, (c1#131 = c2#132) :- SubqueryAlias t1 : +- Project [value#100 AS c1#131] : +- LocalRelation [value#100] +- SubqueryAlias t2 +- Project [value#107 AS c2#132] +- LocalRelation [value#107] == Optimized Logical Plan == Project [(c2#132 + 1) AS c3#123] +- Join LeftAnti, (isnull(((c2#132 + 1) = c2#133)) || ((c2#132 + 1) = c2#133)) :- Join LeftOuter, (c1#131 = c2#132) : :- LocalRelation [c1#131] : +- LocalRelation [c2#132] +- LocalRelation [c2#133] ``` Here, the code generates a new `c2#133` for the T2 referenced in the NOT IN predicate in the Analyzed Logical Plan. With now the two different `c2`, the `PushPredicateThroughJoin` rule recognizes that the equality predicate `(c2#132 + 1) = c2#133 [nullAware]` is not a so-called local predicate (predicate referencing columns from a single relation) anymore and hence the predicate (or, more precise, the LeftAnti operator) is not pushed down to the right table `t2` of the LeftOuter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/14899 Some background on my investigation of the problem: How can we do name resolution in an SQL statement? ``` scala> sql("select * from t1, t1 t2").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Join Inner :- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t1`, t2 == Analyzed Logical Plan == c1: int, c1: int Project [c1#3, c1#147] +- Join Inner :- SubqueryAlias t1 : +- Project [value#1 AS c1#3] : +- LocalRelation [value#1] +- SubqueryAlias t2 +- SubqueryAlias t1 +- Project [value#1 AS c1#147] +- LocalRelation [value#1] ``` Here the second reference to the column c1 is changed to c1#147 by the private method `dedupRight` in object `ResolveReferences` in Analyzer.scala. The problem is `dedupRight` is only called in 3 contexts, Join, Intersect, and Except. ``` // To resolve duplicate expression IDs for Join and Intersect case j @ Join(left, right, _, _) if !j.duplicateResolved => j.copy(right = dedupRight(left, right)) case i @ Intersect(left, right) if !i.duplicateResolved => i.copy(right = dedupRight(left, right)) case i @ Except(left, right) if !i.duplicateResolved => i.copy(right = dedupRight(left, right)) ``` The assumption of this name resolution algorithm is any name conflicts outside of the 3 aforementioned contexts are okay. Those name conflicts will never mix with one another to create any ambiguity. This is certainly not the case in the example below. ``` Seq(1,2).toDF("c1").createOrReplaceTempView("t1") Seq(1).toDF("c2").createOrReplaceTempView("t2") scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").show ++ | c3| ++ | 2| |null| ++ ``` The correct answer is 1 row of (2). From the plan below, the incorrect portion of the plan is the LeftAnti, rewritten from the NOT IN subquery, is pushed down below the (T1 LOJ T2) operation. Because LeftAnti predicate is evaluated to unknown (or null) if any argument of a comparison operator in the predicate is null, e.g. NULL = is evaluated to unknown (which is equivalent to false in the context of a predicate), the LeftAnti predicate cannot be pushed down into a LOJ operation. The logical plan looks like: ``` scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Filter NOT 'c3 IN (list#124) : +- 'SubqueryAlias list#124 : +- 'Project ['c2] :+- 'UnresolvedRelation `t2` +- 'SubqueryAlias t3 +- 'Project [('t2.c2 + 1) AS c3#123] +- 'Join LeftOuter, ('t1.c1 = 't2.c2) :- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == c3: int Project [c3#123] +- Filter NOT predicate-subquery#124 [(c3#123 = c2#77)] : +- SubqueryAlias predicate-subquery#124 [(c3#123 = c2#77)] : +- Project [c2#77] :+- SubqueryAlias t2 : +- Project [value#75 AS c2#77] : +- LocalRelation [value#75] +- SubqueryAlias t3 +- Project [(c2#77 + 1) AS c3#123] +- Join LeftOuter, (c1#3 = c2#77) :- SubqueryAlias t1 : +- Project [value#1 AS c1#3] : +- LocalRelation [value#1] +- SubqueryAlias t2 +- Project [value#75 AS c2#77] +- LocalRelation [value#75] ... ``` The culprit is the column `c2#77` from the two references of `SubqueryAlias t2` are from two different contexts. They cannot share the same name. The problem is subtle at this point from the derived column `c3#123` of the expression `(c2#77 + 1)` but it becomes obvious when looking at the Optimized Logical Plan (below): ``` == Optimized Logical Plan == Project [(c2#77 + 1) AS c3#123] +- Join LeftOuter, (c1#3 = c2#77) :- Project [value#1 AS c1#3] : +- Join LeftAnti, (isnull(((c2#77 + 1) = c2#77)) || ((c2#77 + 1) = c2#77)) : :- LocalRelation [value#1] : +- LocalRelation [c2#77] +- LocalRelation [c2#77] ``` Given more and more rewrite rules can be/will be added in the future, it is impossible to ensure that any name conflicts in any part of the same SQL statement will not collide as the logical plan gets rewritten. I think a solid solution is to generate a unique name for any column from any base relation. --- If your project
[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14899 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org