On Mon, Dec 30, 2013 at 5:02 PM, Guillaume Pitel <[email protected] > wrote:
> Hi > > > I have the same problem here, I need to map some values to ids, and I >> want a unique Int. For now I can use a local zipWithIndex, but it won't >> last for long. >> > > What do you mean by it won't last for long? You can precisely > reconstruct the unique Int by knowing partition index, and the length of > each iterator in a partition. > > > It won't last for long = for now my dataset are small enough, but I'll > have to change it someday > How does it depend on the dataset size? Are you saying zipWithIndex is slow for bigger datasets? zipWithIndex().map() is 2 loops, you should use a simple for/while loop for this to make it one loop, as instructed here: http://stackoverflow.com/a/9137697/1136722 Unfortunately, functional programming and immutability are not in the same direction as performance, and spark choosing scala for this purpose does not benefit us to write faster code. > > You can precisely reconstruct... => that's what the following code is > supposed to do > > > >> The only idea I've found to work around this is to do something like this >> : >> >> val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr) >> => List( (index, itr.count ) ).iterator} >> .collect() >> .sortBy{ case (i,v) => i } >> .map{ case (i,v) => v} >> val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative >> sum >> val partitionsInfo = >> sc.broadcast(partitionsSizes.zip(partitionsStartIndex)) >> dataset.mapPartitionsWithIndex{ case (index,itr) => { >> val partitionInfo = partitionsInfo.value(index) >> itr.zip((partitionInfo._2 until (partitionInfo._2 + >> partitionInfo._1)).iterator) >> } >> } >> > > -- > [image: eXenSa] > *Guillaume PITEL, Président* > +33(0)6 25 48 86 80 > > eXenSa S.A.S. <http://www.exensa.com/> > 41, rue Périer - 92120 Montrouge - FRANCE > Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >
<<exensa_logo_mail.png>>
