Re: groupBy gives non deterministic results

Davies Liu Tue, 09 Sep 2014 10:23:27 -0700

Which version of Spark are you using?

This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to
one of these versions
to verify it?


Davies

On Tue, Sep 9, 2014 at 7:03 AM, redocpot <julien19890...@gmail.com> wrote:
> Thank you for your replies.
>
> More details here:
>
> The prog is executed on local mode (single node). Default env params are
> used.
>
> The test code and the result are in this gist:
> https://gist.github.com/coderh/0147467f0b185462048c
>
> Here is 10 first lines of the data: 3 fields each row, the delimiter is ";"
>
> 3801959;11775022;118
> 3801960;14543202;118
> 3801984;11781380;20
> 3801984;13255417;20
> 3802003;11777557;91
> 3802055;11781159;26
> 3802076;11782793;102
> 3802086;17881551;102
> 3802087;19064728;99
> 3802105;12760994;99
> ...
>
> There are 27 partitions(small files). Total size is about 100 Mb.
>
> We find that this problem is highly probably caused by the bug SPARK-2043:
> https://issues.apache.org/jira/browse/SPARK-2043
>
> Could someone give more details on this bug ?
>
> The pull request say:
>
> The current implementation reads one key with the next hash code as it
> finishes reading the keys with the current hash code, which may cause it to
> miss some matches of the next key. This can cause operations like join to
> give the wrong result when reduce tasks spill to disk and there are hash
> collisions, as values won't be matched together. This PR fixes it by not
> reading in that next key, using a peeking iterator instead.
>
> I don't understand why reading a key with the next hash code will cause it
> to miss some matches of the next key. If someone could show me some code to
> dig in, it's highly appreciated. =)
>
> Thank you.
>
> Hao.
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: groupBy gives non deterministic results

Reply via email to