I think the explanation is that the join does not guarantee any order,
since it causes a shuffle in general, and it is computed twice in the
first example, resulting in a difference for d1 and d2.

You can persist() the result of the join and in practice I believe
you'd find it behaves as expected, although that is even not 100%
guaranteed since a block could be lost and recomputed (differently).

If order matters, and it does for zip(), then the reliable way to
guarantee a well defined ordering for zipping is to sort the RDDs.

On Mon, Mar 23, 2015 at 6:27 PM, Ofer Mendelevitch
<omendelevi...@hortonworks.com> wrote:
> Hi,
>
> I am running into a strange issue when doing a JOIN of two RDDs followed by
> ZIP from PySpark.
> It’s part of a more complex application, but was able to narrow it down to a
> simplified example that’s easy to replicate and causes the same problem to
> appear:
>
>
> raw = sc.parallelize([('k'+str(x),'v'+str(x)) for x in range(100)])
> data = raw.join(raw).mapValues(lambda x: [x[0]]+[x[1]]).map(lambda pair:
> ','.join([x for x in pair[1]]))
> d1 = data.map(lambda s: s.split(',')[0])
> d2 = data.map(lambda s: s.split(',')[1])
> x = d1.zip(d2)
>
> print x.take(10)
>
>
> The output is:
>
>
> [('v44', 'v80'), ('v79', 'v44'), ('v80', 'v79'), ('v45', 'v78'), ('v81',
> 'v81'), ('v78', 'v45'), ('v99', 'v99'), ('v82', 'v82'), ('v46', 'v46'),
> ('v83', 'v83')]
>
>
> As you can see, the ordering of items is not preserved anymore in all cases.
> (e.g., ‘v81’ is preserved, and ‘v45’ is not)
> Is it not supposed to be preserved?
>
> If I do the same thing without the JOIN:
>
> data = sc.parallelize('v'+str(x)+',v'+str(x) for x in range(100))
> d1 = data.map(lambda s: s.split(',')[0])
> d2 = data.map(lambda s: s.split(',')[1])
> x = d1.zip(d2)
>
> print x.take(10)
>
> The output is:
>
>
> [('v0', 'v0'), ('v1', 'v1'), ('v2', 'v2'), ('v3', 'v3'), ('v4', 'v4'),
> ('v5', 'v5'), ('v6', 'v6'), ('v7', 'v7'), ('v8', 'v8'), ('v9', 'v9')]
>
>
> As expected.
>
> Anyone run into this or a similar issue?
>
> Ofer

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to