Re: How to distinguish columns when joining DataFrames with shared parent?

Isabelle Phan Wed, 21 Oct 2015 16:33:50 -0700

Ok, got it.
Thanks a lot Michael for the detailed reply!
On Oct 21, 2015 1:54 PM, "Michael Armbrust" <mich...@databricks.com> wrote:


> Yeah, I was suggesting that you avoid using  
> org.apache.spark.sql.DataFrame.apply(colName:
> String) when you are working with selfjoins as it eagerly binds to a
> specific column in a what that breaks when we do the rewrite of one side of
> the query.  Using the apply method constructs a resolved column eagerly
> (which looses the alias information).
>
> On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>
>> Thanks Michael and Ali for the reply!
>>
>> I'll make sure to use unresolved columns when working with self joins
>> then.
>>
>> As pointed by Ali, isn't there still an issue with the aliasing? It works
>> when using org.apache.spark.sql.functions.col(colName: String) method, but
>> not when using org.apache.spark.sql.DataFrame.apply(colName: String):
>>
>> scala> j.select(col("lv.value")).show
>> +-----+
>> |value|
>> +-----+
>> |   10|
>> |   20|
>> +-----+
>>
>>
>> scala> j.select(largeValues("lv.value")).show
>> +-----+
>> |value|
>> +-----+
>> |    1|
>> |    5|
>> +-----+
>>
>> Or does this behavior have the same root cause as detailed in Michael's
>> email?
>>
>>
>> -Isabelle
>>
>>
>>
>>
>> On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Unfortunately, the mechanisms that we use to differentiate columns
>>> automatically don't work particularly well in the presence of self joins.
>>> However, you can get it work if you use the $"column" syntax
>>> consistently:
>>>
>>> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", 
>>> "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = 
>>> df.filter('value >= 10).as("lv")
>>> 
>>> smallValues
>>>   .join(largeValues, $"sv.key" === $"lv.key")
>>>   .select($"sv.key".as("key"), $"sv.value".as("small_value"), 
>>> $"lv.value".as("large_value"))
>>>   .withColumn("diff", $"small_value" - $"large_value")
>>>   .show()
>>> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+|
>>>   1|          1|         10|  -9||  3|          5|         20| 
>>> -15|+---+-----------+-----------+----+
>>>
>>>
>>> The problem with the other cases is that calling
>>> smallValues("columnName") or largeValues("columnName") is eagerly
>>> resolving the attribute to the same column (since the data is actually
>>> coming from the same place).  By the time we realize that you are joining
>>> the data with itself (at which point we rewrite one side of the join to use
>>> different expression ids) its too late.  At the core the problem is that in
>>> Scala we have no easy way to differentiate largeValues("columnName")
>>> from smallValues("columnName").  This is because the data is coming
>>> from the same DataFrame and we don't actually know which variable name you
>>> are using.  There are things we can change here, but its pretty hard to
>>> change the semantics without breaking other use cases.
>>>
>>> So, this isn't a straight forward "bug", but its definitely a usability
>>> issue.  For now, my advice would be: only use unresolved columns (i.e.
>>> $"[alias.]column" or col("[alias.]column")) when working with self
>>> joins.
>>>
>>> Michael
>>>
>>
>>
>

Re: How to distinguish columns when joining DataFrames with shared parent?

Reply via email to