Re: How to map each line to (line number, line)?

Guillaume Pitel Wed, 01 Jan 2014 04:06:38 -0800

I'm not very comfortable with the idea of generating a rdd from the range (it might take a lot of memory), dispatching it to the nodes, then zipping.

You should try and compare the two approaches and give us the performance comparison.

Guillaume

Why not use a zipped RDD?
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.ZippedRDD

I do not know why no one else suggested this. Of course it has 3 extra loops (one for counting rdd, one for generating the range, one for zipping). Apart from this performance problem, any other caveats?

I have used something like this in the past.

> val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size)

> val rddWithIndex = rdd.zip(index)

If that doesn't work, then you could try zipPartitions as well, since it has slightly more relaxed constraints.

Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: How to map each line to (line number, line)?

Reply via email to