Hello,

When joining 2 DataFrames which originate from the same initial DataFrame,
why can't org.apache.spark.sql.DataFrame.apply(colName: String) method
distinguish which column to read?

Let me illustrate this question with a simple example (ran on Spark 1.5.1):

//my initial DataFrame
scala> df
res39: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> df.show
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|   10|
|  2|    3|
|  3|   20|
|  3|    5|
|  4|   10|
+---+-----+


//2 children DataFrames
scala> val smallValues = df.filter('value < 10)
smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> smallValues.show
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  2|    3|
|  3|    5|
+---+-----+


scala> val largeValues = df.filter('value >= 10)
largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> largeValues.show
+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  3|   20|
|  4|   10|
+---+-----+


//Joining the children
scala> smallValues
  .join(largeValues, smallValues("key") === largeValues("key"))
  .withColumn("diff", smallValues("value") - largeValues("value"))
  .show
15/10/20 16:59:59 WARN Column: Constructing trivially true equals
predicate, 'key#41 = key#41'. Perhaps you need to use aliases.
+---+-----+---+-----+----+
|key|value|key|value|diff|
+---+-----+---+-----+----+
|  1|    1|  1|   10|   0|
|  3|    5|  3|   20|   0|
+---+-----+---+-----+----+


This last command issued a warning, but still executed the join correctly
(rows with key 2 and 4 don't appear in result set). However, the "diff"
column is incorrect.

Is this a bug or am I missing something here?


Thanks a lot for any input,

Isabelle

Reply via email to