[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-19 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589692#comment-15589692
 ] 

Ryan Blue commented on SPARK-17995:
---

I'm not sure how that would work.

Here's an example. Say I have two tables:
{code}
scala> l.show
+---++
| id|left|
+---++
|  1|   a|
|  2|   b|
+---++

scala> r.show
+---+-+
| id|right|
+---+-+
|  1|x|
+---+-+
{code}

When I join those two on {{id}} and add an extra derived column, I end up with 
this *parsed plan*:
{code}
== Parsed Logical Plan ==
Project [id#5 AS id#165, left#6, right#16, isnull(right#16) AS (right IS 
NULL)#166]
+- Join LeftOuter, (id#5 = id#15)
   :- Project [_1#2 AS id#5, _2#3 AS left#6]
   :  +- LocalRelation [_1#2, _2#3]
   +- Project [_1#12 AS id#15, _2#13 AS right#16]
  +- LocalRelation [_1#12, _2#13]
{code}

That has a reference to {{right#16}} in the outer-most project, which is coming 
from {{select(r("right"), ...)}}. That reference -- and any reference like it 
in the possible tree above the outer join -- needs to be replaced with a new 
attribute that correctly shows the {{right}} column as nullable, which wasn't 
inferred when I created the dataframe. It's fairly easy to replace the join 
itself, but replacing all references to an attribute above the join is where I 
don't see an easy solution.

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587348#comment-15587348
 ] 

Wenchen Fan commented on SPARK-17995:
-

Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` 
class? e.g.
{code}
case class Join(..., newOuterJoinAttrs: Seq[Attribute])

object Join {
  def apply(...) = {
val newOuterJoinAttrs = joinType match {
  case LeftOuterJoin => right.output.map(_.newInstance)
}
  }
}
{code}

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586972#comment-15586972
 ] 

Ryan Blue commented on SPARK-17995:
---

[~cloud_fan] and [~yhuai], I'd like to help fix this, but I'm not sure the best 
way.

I started to write an analyzer rule that uses transformUp on the initial 
logical plan, before unresovled aliases are resolved. That rule would find 
outer joins and generate a map of attributes to replace to the new attribute 
above the join, with the schema updated to be nullable and with a new exprId. 
The attributes to replace come from the output of the outer join.

Where I ran into trouble was in replacing the attributes in the logical plan 
above the a join. I don't think it is a good idea to have cases in the rule for 
every possible plan, so I think we need a method to substitute attributes that 
is implemented by nodes in the plan. That sounds like a larger patch than I 
originally thought, so I wanted to make sure I'm going down the right path 
before I put up a PR for it. What do you think?

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org