Thank you for your replies. More details here:
The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter is ";" 3801959;11775022;118 3801960;14543202;118 3801984;11781380;20 3801984;13255417;20 3802003;11777557;91 3802055;11781159;26 3802076;11782793;102 3802086;17881551;102 3802087;19064728;99 3802105;12760994;99 ... There are 27 partitions(small files). Total size is about 100 Mb. We find that this problem is highly probably caused by the bug SPARK-2043: https://issues.apache.org/jira/browse/SPARK-2043 Could someone give more details on this bug ? The pull request say: The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. I don't understand why reading a key with the next hash code will cause it to miss some matches of the next key. If someone could show me some code to dig in, it's highly appreciated. =) Thank you. Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org