Hi

I have the same problem here, I need to map some values to ids, and I want a unique Int. For now I can use a local zipWithIndex, but it won't last for long.

What do you mean by it won't last for long? You can precisely reconstruct the unique Int by knowing partition index, and the length of each iterator in a partition.
 
It won't last for long = for now my dataset are small enough, but I'll have to change it someday

You can precisely reconstruct... => that's what the following code is supposed to do


The only idea I've found to work around this is to do something like this :

val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr) => List( (index, itr.count ) ).iterator}
    .collect()
    .sortBy{ case (i,v) => i }
    .map{ case (i,v) => v}
val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative sum
val partitionsInfo = sc.broadcast(partitionsSizes.zip(partitionsStartIndex))
dataset.mapPartitionsWithIndex{ case (index,itr) => {
      val partitionInfo = partitionsInfo.value(index)
      itr.zip((partitionInfo._2 until (partitionInfo._2 + partitionInfo._1)).iterator)
  }
}

--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to