Ok, got it. Thanks a lot Michael for the detailed reply! On Oct 21, 2015 1:54 PM, "Michael Armbrust" <mich...@databricks.com> wrote:
> Yeah, I was suggesting that you avoid using > org.apache.spark.sql.DataFrame.apply(colName: > String) when you are working with selfjoins as it eagerly binds to a > specific column in a what that breaks when we do the rewrite of one side of > the query. Using the apply method constructs a resolved column eagerly > (which looses the alias information). > > On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <nlip...@gmail.com> wrote: > >> Thanks Michael and Ali for the reply! >> >> I'll make sure to use unresolved columns when working with self joins >> then. >> >> As pointed by Ali, isn't there still an issue with the aliasing? It works >> when using org.apache.spark.sql.functions.col(colName: String) method, but >> not when using org.apache.spark.sql.DataFrame.apply(colName: String): >> >> scala> j.select(col("lv.value")).show >> +-----+ >> |value| >> +-----+ >> | 10| >> | 20| >> +-----+ >> >> >> scala> j.select(largeValues("lv.value")).show >> +-----+ >> |value| >> +-----+ >> | 1| >> | 5| >> +-----+ >> >> Or does this behavior have the same root cause as detailed in Michael's >> email? >> >> >> -Isabelle >> >> >> >> >> On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust < >> mich...@databricks.com> wrote: >> >>> Unfortunately, the mechanisms that we use to differentiate columns >>> automatically don't work particularly well in the presence of self joins. >>> However, you can get it work if you use the $"column" syntax >>> consistently: >>> >>> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", >>> "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = >>> df.filter('value >= 10).as("lv") >>> >>> smallValues >>> .join(largeValues, $"sv.key" === $"lv.key") >>> .select($"sv.key".as("key"), $"sv.value".as("small_value"), >>> $"lv.value".as("large_value")) >>> .withColumn("diff", $"small_value" - $"large_value") >>> .show() >>> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+| >>> 1| 1| 10| -9|| 3| 5| 20| >>> -15|+---+-----------+-----------+----+ >>> >>> >>> The problem with the other cases is that calling >>> smallValues("columnName") or largeValues("columnName") is eagerly >>> resolving the attribute to the same column (since the data is actually >>> coming from the same place). By the time we realize that you are joining >>> the data with itself (at which point we rewrite one side of the join to use >>> different expression ids) its too late. At the core the problem is that in >>> Scala we have no easy way to differentiate largeValues("columnName") >>> from smallValues("columnName"). This is because the data is coming >>> from the same DataFrame and we don't actually know which variable name you >>> are using. There are things we can change here, but its pretty hard to >>> change the semantics without breaking other use cases. >>> >>> So, this isn't a straight forward "bug", but its definitely a usability >>> issue. For now, my advice would be: only use unresolved columns (i.e. >>> $"[alias.]column" or col("[alias.]column")) when working with self >>> joins. >>> >>> Michael >>> >> >> >