Thanks Michael and Ali for the reply!

I'll make sure to use unresolved columns when working with self joins then.

As pointed by Ali, isn't there still an issue with the aliasing? It works
when using org.apache.spark.sql.functions.col(colName: String) method, but
not when using org.apache.spark.sql.DataFrame.apply(colName: String):

scala> j.select(col("lv.value")).show
+-----+
|value|
+-----+
|   10|
|   20|
+-----+


scala> j.select(largeValues("lv.value")).show
+-----+
|value|
+-----+
|    1|
|    5|
+-----+

Or does this behavior have the same root cause as detailed in Michael's
email?


-Isabelle




On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Unfortunately, the mechanisms that we use to differentiate columns
> automatically don't work particularly well in the presence of self joins.
> However, you can get it work if you use the $"column" syntax consistently:
>
> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", 
> "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = 
> df.filter('value >= 10).as("lv")
> ​
> smallValues
>   .join(largeValues, $"sv.key" === $"lv.key")
>   .select($"sv.key".as("key"), $"sv.value".as("small_value"), 
> $"lv.value".as("large_value"))
>   .withColumn("diff", $"small_value" - $"large_value")
>   .show()
> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+|
>   1|          1|         10|  -9||  3|          5|         20| 
> -15|+---+-----------+-----------+----+
>
>
> The problem with the other cases is that calling smallValues("columnName")
> or largeValues("columnName") is eagerly resolving the attribute to the
> same column (since the data is actually coming from the same place).  By
> the time we realize that you are joining the data with itself (at which
> point we rewrite one side of the join to use different expression ids) its
> too late.  At the core the problem is that in Scala we have no easy way to
> differentiate largeValues("columnName") from smallValues("columnName").
> This is because the data is coming from the same DataFrame and we don't
> actually know which variable name you are using.  There are things we can
> change here, but its pretty hard to change the semantics without breaking
> other use cases.
>
> So, this isn't a straight forward "bug", but its definitely a usability
> issue.  For now, my advice would be: only use unresolved columns (i.e.
> $"[alias.]column" or col("[alias.]column")) when working with self joins.
>
> Michael
>

Reply via email to