I'm not very comfortable with the idea of generating a rdd from the range (it might take a lot of memory), dispatching it to the nodes, then zipping.

You should try and compare the two approaches and give us the performance comparison.

Guillaume

I do not know why no one else suggested this. Of course it has 3 extra loops (one for counting rdd, one for generating the range, one for zipping). Apart from this performance problem, any other caveats?
 

I have used something like this in the past.

> val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size)
> val rddWithIndex = rdd.zip(index)

If that doesn't work, then you could try zipPartitions as well, since it has slightly more relaxed constraints.


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to