Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Patrick Wendell Wed, 08 Oct 2014 22:14:50 -0700

IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the ordering of an RDD.


On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
<coded...@cs.stanford.edu> wrote:
> I noticed that repartition will result in non-deterministic lineage because
> it'll result in changed orders for rows.
>
> So for instance, if you do things like:
>
> val data = read(...)
> val k = data.repartition(5)
> val h = k.repartition(5)
>
> It seems that this results in different ordering of rows for 'k' each time
> you call it.
> And because of this different ordering, 'h' will result in different
> partitions even, because 'repartition' distributes through a random number
> generator with the 'index' as the key.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Reply via email to