On Mon, Dec 30, 2013 at 4:06 PM, Guillaume Pitel <[email protected] > wrote:
> Hi,
>
> I have the same problem here, I need to map some values to ids, and I want
> a unique Int. For now I can use a local zipWithIndex, but it won't last for
> long.
>
What do you mean by it won't last for long? You can precisely reconstruct
the unique Int by knowing partition index, and the length of each iterator
in a partition.
>
> The only idea I've found to work around this is to do something like this :
>
> val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr) =>
> List( (index, itr.count ) ).iterator}
> .collect()
> .sortBy{ case (i,v) => i }
> .map{ case (i,v) => v}
> val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative
> sum
> val partitionsInfo =
> sc.broadcast(partitionsSizes.zip(partitionsStartIndex))
> dataset.mapPartitionsWithIndex{ case (index,itr) => {
> val partitionInfo = partitionsInfo.value(index)
> itr.zip((partitionInfo._2 until (partitionInfo._2 +
> partitionInfo._1)).iterator)
> }
> }
>
> Probably not the best solution (it requires 2 maps and a collect), and not
> tested, let me know if it works :)
>
> Guillaume
>
> Hi,
>
> When reading a simple text file in spark, what's the best way of mapping
> each line to (line number, line)? RDD doesn't seem to have an equivalent of
> zipWithIndex.
>
>
>
> --
> [image: eXenSa]
> *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>
<<exensa_logo_mail.png>>
