You're making assumptions about the partitioning.

On Mon, Dec 30, 2013 at 1:18 PM, Aureliano Buendia <[email protected]>wrote:

>
>
>
> On Mon, Dec 30, 2013 at 6:48 PM, Guillaume Pitel <
> [email protected]> wrote:
>
>>
>>
>>
>>         It won't last for long = for now my dataset are small enough,
>>> but I'll have to change it someday
>>>
>>
>>  How does it depend on the dataset size? Are you saying zipWithIndex is
>> slow for bigger datasets?
>>
>>
>> No, but for a multi-tens of billion elements dataset, you cannot fit your
>> elements in memory on a single host.
>>
>> So at some point, the solution I currently use (not the one I sent) :
>> dataset.collect().zipWithIndex just won't scale.
>>
>> Did you try the code I sent ? I think the sortBy is probably in the wrong
>> direction, so change it with -i instead of i
>>
>
> I'm confused why would need in memory sorting. We just use a loop like any
> other loops in spark. Why shouldn't this solve the problem?:
>
>     val count = lines.count() // lines is the rdd
>     val partitionLinesCount = count / rdd.partitions.length
>     linesWithIndex = lines.mapPartitionsWithIndex { (pi, it) =>
>       var i = pi * partitionLinesCount
>       it.map {
>         *line => (i, line)*
>         i += 1
>       }
>     }
>
>
>>
>> Guillaume
>>
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>
>>  eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>
>
>

<<exensa_logo_mail.png>>

Reply via email to