Morten Hornbech created SPARK-23614: ---------------------------------------
Summary: Union produces incorrect results when caching is used Key: SPARK-23614 URL: https://issues.apache.org/jira/browse/SPARK-23614 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Morten Hornbech We just upgraded from 2.2 to 2.3 and our test suite caught this error: {code:java} val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 6))).cache() val group1 = frame.groupBy("x").agg(min(col("y")) as "value") val group2 = frame.groupBy("x").agg(min(col("z")) as "value") group1.union(group2).show() // +---+-----+ // | x|value| // +---+-----+ // | 1| 2| // | 4| 5| // | 1| 2| // | 4| 5| // +---+-----+ group2.union(group1).show() // +---+-----+ // | x|value| // +---+-----+ // | 1| 3| // | 4| 6| // | 1| 3| // | 4| 6| // +---+-----+ {code} The error disappears if the first data frame is not cached or if the two group by's use separate copies. I'm not sure exactly what happens on the insides of Spark, but errors that produce incorrect results rather than exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org