On Mon, Dec 30, 2013 at 5:02 PM, Guillaume Pitel <[email protected]
> wrote:

>  Hi
>
>
>   I have the same problem here, I need to map some values to ids, and I
>> want a unique Int. For now I can use a local zipWithIndex, but it won't
>> last for long.
>>
>
>  What do you mean by it won't last for long? You can precisely
> reconstruct the unique Int by knowing partition index, and the length of
> each iterator in a partition.
>
>
> It won't last for long = for now my dataset are small enough, but I'll
> have to change it someday
>

How does it depend on the dataset size? Are you saying zipWithIndex is slow
for bigger datasets?

zipWithIndex().map() is 2 loops, you should use a simple for/while loop for
this to make it one loop, as instructed here:
http://stackoverflow.com/a/9137697/1136722

Unfortunately, functional programming and immutability are not in the same
direction as performance, and spark choosing scala for this purpose does
not benefit us to write faster code.


>
> You can precisely reconstruct... => that's what the following code is
> supposed to do
>
>
>
>> The only idea I've found to work around this is to do something like this
>> :
>>
>> val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr)
>> => List( (index, itr.count ) ).iterator}
>>     .collect()
>>     .sortBy{ case (i,v) => i }
>>     .map{ case (i,v) => v}
>> val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative
>> sum
>> val partitionsInfo =
>> sc.broadcast(partitionsSizes.zip(partitionsStartIndex))
>> dataset.mapPartitionsWithIndex{ case (index,itr) => {
>>       val partitionInfo = partitionsInfo.value(index)
>>       itr.zip((partitionInfo._2 until (partitionInfo._2 +
>> partitionInfo._1)).iterator)
>>   }
>> }
>>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<exensa_logo_mail.png>>

Reply via email to