Furthermore, even adding aliasing as suggested by the warning doesn't seem to help either. Slight modification to example below:
> scala> val largeValues = df.filter('value >= 10).as("lv") And just looking at the join results: > scala> val j = smallValues > .join(largeValues, smallValues("key") === largeValues("key")) scala> j.select($"value").show This will throw an exception indicating that "value" is ambiguous (to be expected). scala> j.select(smallValues("value")).show This will show the left (small values) "values" column as expected. scala> j.select(largeValues("value")).show This will show the left (small values) "values" column (resolved to the wrong column) scala> j.select(largeValues("lv.value")).show This will show the left (small values) "values" column (resolved to the wrong column even though we explicitly specified the alias and used the right hand df) scala> j.select($"lv.value").show Produces a cannot resolve 'lv.value' exception (so the lv alias is not preserved in the join result). Anyone know the appropriate way to use the aliases in DataFrame operations or is this a bug? -- Ali On Oct 20, 2015, at 5:23 PM, Isabelle Phan <nlip...@gmail.com> wrote: > Hello, > > When joining 2 DataFrames which originate from the same initial DataFrame, > why can't org.apache.spark.sql.DataFrame.apply(colName: String) method > distinguish which column to read? > > Let me illustrate this question with a simple example (ran on Spark 1.5.1): > > //my initial DataFrame > scala> df > res39: org.apache.spark.sql.DataFrame = [key: int, value: int] > > scala> df.show > +---+-----+ > |key|value| > +---+-----+ > | 1| 1| > | 1| 10| > | 2| 3| > | 3| 20| > | 3| 5| > | 4| 10| > +---+-----+ > > > //2 children DataFrames > scala> val smallValues = df.filter('value < 10) > smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int] > > scala> smallValues.show > +---+-----+ > |key|value| > +---+-----+ > | 1| 1| > | 2| 3| > | 3| 5| > +---+-----+ > > > scala> val largeValues = df.filter('value >= 10) > largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int] > > scala> largeValues.show > +---+-----+ > |key|value| > +---+-----+ > | 1| 10| > | 3| 20| > | 4| 10| > +---+-----+ > > > //Joining the children > scala> smallValues > .join(largeValues, smallValues("key") === largeValues("key")) > .withColumn("diff", smallValues("value") - largeValues("value")) > .show > 15/10/20 16:59:59 WARN Column: Constructing trivially true equals predicate, > 'key#41 = key#41'. Perhaps you need to use aliases. > +---+-----+---+-----+----+ > |key|value|key|value|diff| > +---+-----+---+-----+----+ > | 1| 1| 1| 10| 0| > | 3| 5| 3| 20| 0| > +---+-----+---+-----+----+ > > > This last command issued a warning, but still executed the join correctly > (rows with key 2 and 4 don't appear in result set). However, the "diff" > column is incorrect. > > Is this a bug or am I missing something here? > > > Thanks a lot for any input, > > Isabelle