On Mon, Dec 30, 2013 at 3:46 PM, Tom Vacek <[email protected]> wrote:

> Yes, but a (partitionID, partitionIndex) tuple is a unique identifier
> that's just as useful---and you can map that to unique line numbers at any
> time.  myRdd.mapPartitionsWithIndex( (id, it) => it.zipWithIndex.map{case
> (el, fID) => ( (id, fID), el) } )
>

Partition index by itself wouldn't be useful, unless we know how many lines
per partition we have. For that, we need to count the whole lines first,
and divide it by partition numbers.

IS this the fastest way of doing this?


>
>
> On Mon, Dec 30, 2013 at 8:41 AM, Aureliano Buendia 
> <[email protected]>wrote:
>
>> One thing could make this more complicated is partitioning.
>>
>>
>> On Mon, Dec 30, 2013 at 12:28 PM, Aureliano Buendia <[email protected]
>> > wrote:
>>
>>> Hi,
>>>
>>> When reading a simple text file in spark, what's the best way of mapping
>>> each line to (line number, line)? RDD doesn't seem to have an equivalent of
>>> zipWithIndex.
>>>
>>
>>
>

Reply via email to