Re: How to map each line to (line number, line)?

Aureliano Buendia Mon, 30 Dec 2013 08:27:03 -0800

On Mon, Dec 30, 2013 at 4:06 PM, Guillaume Pitel <[email protected]
> wrote:


>  Hi,
>
> I have the same problem here, I need to map some values to ids, and I want
> a unique Int. For now I can use a local zipWithIndex, but it won't last for
> long.
>

What do you mean by it won't last for long? You can precisely reconstruct
the unique Int by knowing partition index, and the length of each iterator
in a partition.


>
> The only idea I've found to work around this is to do something like this :
>
> val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr) =>
> List( (index, itr.count ) ).iterator}
>     .collect()
>     .sortBy{ case (i,v) => i }
>     .map{ case (i,v) => v}
> val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative
> sum
> val partitionsInfo =
> sc.broadcast(partitionsSizes.zip(partitionsStartIndex))
> dataset.mapPartitionsWithIndex{ case (index,itr) => {
>       val partitionInfo = partitionsInfo.value(index)
>       itr.zip((partitionInfo._2 until (partitionInfo._2 +
> partitionInfo._1)).iterator)
>   }
> }
>
> Probably not the best solution (it requires 2 maps and a collect), and not
> tested, let me know if it works :)
>
> Guillaume
>
>   Hi,
>
>  When reading a simple text file in spark, what's the best way of mapping
> each line to (line number, line)? RDD doesn't seem to have an equivalent of
> zipWithIndex.
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<exensa_logo_mail.png>>

Re: How to map each line to (line number, line)?

Reply via email to