Thank you for your replies.

More details here:

The prog is executed on local mode (single node). Default env params are
used.

The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c

Here is 10 first lines of the data: 3 fields each row, the delimiter is ";"

3801959;11775022;118
3801960;14543202;118
3801984;11781380;20
3801984;13255417;20
3802003;11777557;91
3802055;11781159;26
3802076;11782793;102
3802086;17881551;102
3802087;19064728;99
3802105;12760994;99
...

There are 27 partitions(small files). Total size is about 100 Mb.

We find that this problem is highly probably caused by the bug SPARK-2043:
https://issues.apache.org/jira/browse/SPARK-2043

Could someone give more details on this bug ?

The pull request say: 

The current implementation reads one key with the next hash code as it
finishes reading the keys with the current hash code, which may cause it to
miss some matches of the next key. This can cause operations like join to
give the wrong result when reduce tasks spill to disk and there are hash
collisions, as values won't be matched together. This PR fixes it by not
reading in that next key, using a peeking iterator instead.

I don't understand why reading a key with the next hash code will cause it
to miss some matches of the next key. If someone could show me some code to
dig in, it's highly appreciated. =)

Thank you.

Hao.











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to