[jira] [Resolved] (SPARK-10892) Join with Data Frame returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-10892. --- Fix Version/s: (was: 3.0.0) Resolution: Duplicate > Join with Data Frame returns wrong results > -- > > Key: SPARK-10892 > URL: https://issues.apache.org/jira/browse/SPARK-10892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0, 2.4.0 >Reporter: Ofer Mendelevitch >Assignee: Wenchen Fan >Priority: Critical > Labels: correctness > Attachments: data.json > > > I'm attaching a simplified reproducible example of the problem: > 1. Loading a JSON file from HDFS as a Data Frame > 2. Creating 3 data frames: PRCP, TMIN, TMAX > 3. Joining the data frames together. Each of those has a column "value" with > the same name, so renaming them after the join. > 4. The output seems incorrect; the first column has the correct values, but > the two other columns seem to have a copy of the values from the first column. > Here's the sample code: > {code} > import org.apache.spark.sql._ > val sqlc = new SQLContext(sc) > val weather = sqlc.read.format("json").load("data.json") > val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache() > val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache() > val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache() > prcp.filter("year=2012 and month=10").show() > tmin.filter("year=2012 and month=10").show() > tmax.filter("year=2012 and month=10").show() > val out = (prcp.join(tmin, "date_str").join(tmax, "date_str") > .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"), > prcp("value").alias("PRCP"), tmin("value").alias("TMIN"), > tmax("value").alias("TMAX")) ) > out.filter("year=2012 and month=10").show() > {code} > The output is: > {code} > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| PRCP| 10|USW00023272|0|2012| > |20121002| 2| PRCP| 10|USW00023272|0|2012| > |20121003| 3| PRCP| 10|USW00023272|0|2012| > |20121004| 4| PRCP| 10|USW00023272|0|2012| > |20121005| 5| PRCP| 10|USW00023272|0|2012| > |20121006| 6| PRCP| 10|USW00023272|0|2012| > |20121007| 7| PRCP| 10|USW00023272|0|2012| > |20121008| 8| PRCP| 10|USW00023272|0|2012| > |20121009| 9| PRCP| 10|USW00023272|0|2012| > |20121010| 10| PRCP| 10|USW00023272|0|2012| > |20121011| 11| PRCP| 10|USW00023272|3|2012| > |20121012| 12| PRCP| 10|USW00023272|0|2012| > |20121013| 13| PRCP| 10|USW00023272|0|2012| > |20121014| 14| PRCP| 10|USW00023272|0|2012| > |20121015| 15| PRCP| 10|USW00023272|0|2012| > |20121016| 16| PRCP| 10|USW00023272|0|2012| > |20121017| 17| PRCP| 10|USW00023272|0|2012| > |20121018| 18| PRCP| 10|USW00023272|0|2012| > |20121019| 19| PRCP| 10|USW00023272|0|2012| > |20121020| 20| PRCP| 10|USW00023272|0|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMIN| 10|USW00023272| 139|2012| > |20121002| 2| TMIN| 10|USW00023272| 178|2012| > |20121003| 3| TMIN| 10|USW00023272| 144|2012| > |20121004| 4| TMIN| 10|USW00023272| 144|2012| > |20121005| 5| TMIN| 10|USW00023272| 139|2012| > |20121006| 6| TMIN| 10|USW00023272| 128|2012| > |20121007| 7| TMIN| 10|USW00023272| 122|2012| > |20121008| 8| TMIN| 10|USW00023272| 122|2012| > |20121009| 9| TMIN| 10|USW00023272| 139|2012| > |20121010| 10| TMIN| 10|USW00023272| 128|2012| > |20121011| 11| TMIN| 10|USW00023272| 122|2012| > |20121012| 12| TMIN| 10|USW00023272| 117|2012| > |20121013| 13| TMIN| 10|USW00023272| 122|2012| > |20121014| 14| TMIN| 10|USW00023272| 128|2012| > |20121015| 15| TMIN| 10|USW00023272| 128|2012| > |20121016| 16| TMIN| 10|USW00023272| 156|2012| > |20121017| 17| TMIN| 10|USW00023272| 139|2012| > |20121018| 18| TMIN| 10|USW00023272| 161|2012| > |20121019| 19| TMIN| 10|USW00023272| 133|2012| > |20121020| 20| TMIN| 10|USW00023272| 122|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMAX| 10|USW00023272| 322|2012| > |20121002| 2| TMAX| 10|USW00023272| 344|2012| > |20121003| 3| TMAX| 10|USW00023272| 222|2012| > |20121004| 4| TMAX| 10|USW00023272| 189|2012| > |20121005| 5| TMAX
[jira] [Resolved] (SPARK-10892) Join with Data Frame returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-10892. -- Fix Version/s: 3.0.0 Assignee: Wenchen Fan Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/25107] > Join with Data Frame returns wrong results > -- > > Key: SPARK-10892 > URL: https://issues.apache.org/jira/browse/SPARK-10892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0, 2.4.0 >Reporter: Ofer Mendelevitch >Assignee: Wenchen Fan >Priority: Critical > Labels: correctness > Fix For: 3.0.0 > > Attachments: data.json > > > I'm attaching a simplified reproducible example of the problem: > 1. Loading a JSON file from HDFS as a Data Frame > 2. Creating 3 data frames: PRCP, TMIN, TMAX > 3. Joining the data frames together. Each of those has a column "value" with > the same name, so renaming them after the join. > 4. The output seems incorrect; the first column has the correct values, but > the two other columns seem to have a copy of the values from the first column. > Here's the sample code: > {code} > import org.apache.spark.sql._ > val sqlc = new SQLContext(sc) > val weather = sqlc.read.format("json").load("data.json") > val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache() > val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache() > val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache() > prcp.filter("year=2012 and month=10").show() > tmin.filter("year=2012 and month=10").show() > tmax.filter("year=2012 and month=10").show() > val out = (prcp.join(tmin, "date_str").join(tmax, "date_str") > .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"), > prcp("value").alias("PRCP"), tmin("value").alias("TMIN"), > tmax("value").alias("TMAX")) ) > out.filter("year=2012 and month=10").show() > {code} > The output is: > {code} > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| PRCP| 10|USW00023272|0|2012| > |20121002| 2| PRCP| 10|USW00023272|0|2012| > |20121003| 3| PRCP| 10|USW00023272|0|2012| > |20121004| 4| PRCP| 10|USW00023272|0|2012| > |20121005| 5| PRCP| 10|USW00023272|0|2012| > |20121006| 6| PRCP| 10|USW00023272|0|2012| > |20121007| 7| PRCP| 10|USW00023272|0|2012| > |20121008| 8| PRCP| 10|USW00023272|0|2012| > |20121009| 9| PRCP| 10|USW00023272|0|2012| > |20121010| 10| PRCP| 10|USW00023272|0|2012| > |20121011| 11| PRCP| 10|USW00023272|3|2012| > |20121012| 12| PRCP| 10|USW00023272|0|2012| > |20121013| 13| PRCP| 10|USW00023272|0|2012| > |20121014| 14| PRCP| 10|USW00023272|0|2012| > |20121015| 15| PRCP| 10|USW00023272|0|2012| > |20121016| 16| PRCP| 10|USW00023272|0|2012| > |20121017| 17| PRCP| 10|USW00023272|0|2012| > |20121018| 18| PRCP| 10|USW00023272|0|2012| > |20121019| 19| PRCP| 10|USW00023272|0|2012| > |20121020| 20| PRCP| 10|USW00023272|0|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMIN| 10|USW00023272| 139|2012| > |20121002| 2| TMIN| 10|USW00023272| 178|2012| > |20121003| 3| TMIN| 10|USW00023272| 144|2012| > |20121004| 4| TMIN| 10|USW00023272| 144|2012| > |20121005| 5| TMIN| 10|USW00023272| 139|2012| > |20121006| 6| TMIN| 10|USW00023272| 128|2012| > |20121007| 7| TMIN| 10|USW00023272| 122|2012| > |20121008| 8| TMIN| 10|USW00023272| 122|2012| > |20121009| 9| TMIN| 10|USW00023272| 139|2012| > |20121010| 10| TMIN| 10|USW00023272| 128|2012| > |20121011| 11| TMIN| 10|USW00023272| 122|2012| > |20121012| 12| TMIN| 10|USW00023272| 117|2012| > |20121013| 13| TMIN| 10|USW00023272| 122|2012| > |20121014| 14| TMIN| 10|USW00023272| 128|2012| > |20121015| 15| TMIN| 10|USW00023272| 128|2012| > |20121016| 16| TMIN| 10|USW00023272| 156|2012| > |20121017| 17| TMIN| 10|USW00023272| 139|2012| > |20121018| 18| TMIN| 10|USW00023272| 161|2012| > |20121019| 19| TMIN| 10|USW00023272| 133|2012| > |20121020| 20| TMIN| 10|USW00023272| 122|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMAX| 10|USW00023272| 322|2012| > |20121002| 2| TMAX| 10|USW00023272| 344|2012| > |20121003| 3|