Re: How to distinguish columns when joining DataFrames with shared parent?

Xiao Li Thu, 22 Oct 2015 06:56:07 -0700

Actually, I found a design issue in self joins. When we have multiple-layer
projections above alias, the information of alias relation between alias
and actual columns are lost. Thus, when resolving the alias in self joins,
the rules treat the alias (e.g., in Projection) as normal columns. This
only happens when using dataFrames. When using sql, the duplicate names
after self join will stop another self join.


We need a mechanism to trace back the original/actual column for each
alias, like what RDBMS optimizers are doing. The most efficient way is to
directly store the alias-information in the node to indicate if this is
from alias; otherwise, we need to traverse the underlying tree for each
column to confirm it is not from alias even if it is not from an alias

Good luck,

Xiao Li

2015-10-21 16:33 GMT-07:00 Isabelle Phan <nlip...@gmail.com>:

> Ok, got it.
> Thanks a lot Michael for the detailed reply!
> On Oct 21, 2015 1:54 PM, "Michael Armbrust" <mich...@databricks.com>
> wrote:
>
>> Yeah, I was suggesting that you avoid using  
>> org.apache.spark.sql.DataFrame.apply(colName:
>> String) when you are working with selfjoins as it eagerly binds to a
>> specific column in a what that breaks when we do the rewrite of one side of
>> the query.  Using the apply method constructs a resolved column eagerly
>> (which looses the alias information).
>>
>> On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>
>>> Thanks Michael and Ali for the reply!
>>>
>>> I'll make sure to use unresolved columns when working with self joins
>>> then.
>>>
>>> As pointed by Ali, isn't there still an issue with the aliasing? It
>>> works when using org.apache.spark.sql.functions.col(colName: String)
>>> method, but not when using org.apache.spark.sql.DataFrame.apply(colName:
>>> String):
>>>
>>> scala> j.select(col("lv.value")).show
>>> +-----+
>>> |value|
>>> +-----+
>>> |   10|
>>> |   20|
>>> +-----+
>>>
>>>
>>> scala> j.select(largeValues("lv.value")).show
>>> +-----+
>>> |value|
>>> +-----+
>>> |    1|
>>> |    5|
>>> +-----+
>>>
>>> Or does this behavior have the same root cause as detailed in Michael's
>>> email?
>>>
>>>
>>> -Isabelle
>>>
>>>
>>>
>>>
>>> On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Unfortunately, the mechanisms that we use to differentiate columns
>>>> automatically don't work particularly well in the presence of self joins.
>>>> However, you can get it work if you use the $"column" syntax
>>>> consistently:
>>>>
>>>> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 
>>>> 10)).toDF("key", "value")val smallValues = df.filter('value < 
>>>> 10).as("sv")val largeValues = df.filter('value >= 10).as("lv")
>>>> 
>>>> smallValues
>>>>   .join(largeValues, $"sv.key" === $"lv.key")
>>>>   .select($"sv.key".as("key"), $"sv.value".as("small_value"), 
>>>> $"lv.value".as("large_value"))
>>>>   .withColumn("diff", $"small_value" - $"large_value")
>>>>   .show()
>>>> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+|
>>>>   1|          1|         10|  -9||  3|          5|         20| 
>>>> -15|+---+-----------+-----------+----+
>>>>
>>>>
>>>> The problem with the other cases is that calling
>>>> smallValues("columnName") or largeValues("columnName") is eagerly
>>>> resolving the attribute to the same column (since the data is actually
>>>> coming from the same place).  By the time we realize that you are joining
>>>> the data with itself (at which point we rewrite one side of the join to use
>>>> different expression ids) its too late.  At the core the problem is that in
>>>> Scala we have no easy way to differentiate largeValues("columnName")
>>>> from smallValues("columnName").  This is because the data is coming
>>>> from the same DataFrame and we don't actually know which variable name you
>>>> are using.  There are things we can change here, but its pretty hard to
>>>> change the semantics without breaking other use cases.
>>>>
>>>> So, this isn't a straight forward "bug", but its definitely a usability
>>>> issue.  For now, my advice would be: only use unresolved columns (i.e.
>>>> $"[alias.]column" or col("[alias.]column")) when working with self
>>>> joins.
>>>>
>>>> Michael
>>>>
>>>
>>>
>>

Re: How to distinguish columns when joining DataFrames with shared parent?

Reply via email to