Hi,

I have the same problem here, I need to map some values to ids, and I want a unique Int. For now I can use a local zipWithIndex, but it won't last for long.

The only idea I've found to work around this is to do something like this :

val partitionsSizes = dataset.mapPartitionsWithIndex{ case (index, itr) => List( (index, itr.count ) ).iterator}
    .collect()
    .sortBy{ case (i,v) => i }
    .map{ case (i,v) => v}
val partitionsStartIndex = partitionsSizes.scanLeft(0)(_+_) // cumulative sum
val partitionsInfo = sc.broadcast(partitionsSizes.zip(partitionsStartIndex))
dataset.mapPartitionsWithIndex{ case (index,itr) => {
      val partitionInfo = partitionsInfo.value(index)
      itr.zip((partitionInfo._2 until (partitionInfo._2 + partitionInfo._1)).iterator)
  }
}

Probably not the best solution (it requires 2 maps and a collect), and not tested, let me know if it works :)

Guillaume

Hi,

When reading a simple text file in spark, what's the best way of mapping each line to (line number, line)? RDD doesn't seem to have an equivalent of zipWithIndex.


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to