You're making assumptions about the partitioning.
On Mon, Dec 30, 2013 at 1:18 PM, Aureliano Buendia <[email protected]>wrote: > > > > On Mon, Dec 30, 2013 at 6:48 PM, Guillaume Pitel < > [email protected]> wrote: > >> >> >> >> It won't last for long = for now my dataset are small enough, >>> but I'll have to change it someday >>> >> >> How does it depend on the dataset size? Are you saying zipWithIndex is >> slow for bigger datasets? >> >> >> No, but for a multi-tens of billion elements dataset, you cannot fit your >> elements in memory on a single host. >> >> So at some point, the solution I currently use (not the one I sent) : >> dataset.collect().zipWithIndex just won't scale. >> >> Did you try the code I sent ? I think the sortBy is probably in the wrong >> direction, so change it with -i instead of i >> > > I'm confused why would need in memory sorting. We just use a loop like any > other loops in spark. Why shouldn't this solve the problem?: > > val count = lines.count() // lines is the rdd > val partitionLinesCount = count / rdd.partitions.length > linesWithIndex = lines.mapPartitionsWithIndex { (pi, it) => > var i = pi * partitionLinesCount > it.map { > *line => (i, line)* > i += 1 > } > } > > >> >> Guillaume >> >> -- >> [image: eXenSa] >> *Guillaume PITEL, Président* >> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53 >> >> eXenSa S.A.S. <http://www.exensa.com/> >> 41, rue Périer - 92120 Montrouge - FRANCE >> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >> > >
<<exensa_logo_mail.png>>
