On Mon, Dec 30, 2013 at 6:48 PM, Guillaume Pitel <[email protected] > wrote:
>
>
>
> It won't last for long = for now my dataset are small enough, but
>> I'll have to change it someday
>>
>
> How does it depend on the dataset size? Are you saying zipWithIndex is
> slow for bigger datasets?
>
>
> No, but for a multi-tens of billion elements dataset, you cannot fit your
> elements in memory on a single host.
>
> So at some point, the solution I currently use (not the one I sent) :
> dataset.collect().zipWithIndex just won't scale.
>
> Did you try the code I sent ? I think the sortBy is probably in the wrong
> direction, so change it with -i instead of i
>
I'm confused why would need in memory sorting. We just use a loop like any
other loops in spark. Why shouldn't this solve the problem?:
val count = lines.count() // lines is the rdd
val partitionLinesCount = count / rdd.partitions.length
linesWithIndex = lines.mapPartitionsWithIndex { (pi, it) =>
var i = pi * partitionLinesCount
it.map {
*line => (i, line)*
i += 1
}
}
>
> Guillaume
>
> --
> [image: eXenSa]
> *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
> eXenSa S.A.S. <http://www.exensa.com/>
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>
<<exensa_logo_mail.png>>
