Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read?
Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df res39: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> df.show +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 10| | 2| 3| | 3| 20| | 3| 5| | 4| 10| +---+-----+ //2 children DataFrames scala> val smallValues = df.filter('value < 10) smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> smallValues.show +---+-----+ |key|value| +---+-----+ | 1| 1| | 2| 3| | 3| 5| +---+-----+ scala> val largeValues = df.filter('value >= 10) largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> largeValues.show +---+-----+ |key|value| +---+-----+ | 1| 10| | 3| 20| | 4| 10| +---+-----+ //Joining the children scala> smallValues .join(largeValues, smallValues("key") === largeValues("key")) .withColumn("diff", smallValues("value") - largeValues("value")) .show 15/10/20 16:59:59 WARN Column: Constructing trivially true equals predicate, 'key#41 = key#41'. Perhaps you need to use aliases. +---+-----+---+-----+----+ |key|value|key|value|diff| +---+-----+---+-----+----+ | 1| 1| 1| 10| 0| | 3| 5| 3| 20| 0| +---+-----+---+-----+----+ This last command issued a warning, but still executed the join correctly (rows with key 2 and 4 don't appear in result set). However, the "diff" column is incorrect. Is this a bug or am I missing something here? Thanks a lot for any input, Isabelle