Morten Hornbech created SPARK-23614:
---------------------------------------

             Summary: Union produces incorrect results when caching is used
                 Key: SPARK-23614
                 URL: https://issues.apache.org/jira/browse/SPARK-23614
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.0
            Reporter: Morten Hornbech


We just upgraded from 2.2 to 2.3 and our test suite caught this error:

{code:java}
val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
6))).cache()
val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
group1.union(group2).show()
// +---+-----+
// | x|value|
// +---+-----+
// | 1| 2|
// | 4| 5|
// | 1| 2|
// | 4| 5|
// +---+-----+
group2.union(group1).show()
// +---+-----+
// | x|value|
// +---+-----+
// | 1| 3|
// | 4| 6|
// | 1| 3|
// | 4| 6|
// +---+-----+
{code}

The error disappears if the first data frame is not cached or if the two group 
by's use separate copies. I'm not sure exactly what happens on the insides of 
Spark, but errors that produce incorrect results rather than exceptions always 
concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to