Re: How to map each line to (line number, line)?

Guillaume Pitel Mon, 30 Dec 2013 12:02:28 -0800

You're assuming each partition has the same line count. I don't think it's true (actually, I'm almost certain it's false). And anyway your code also require two maps.

In my code, the sorting as well as the other operations are performed on a very small dataset : one element per partition

Guillaume

Did you try the code I sent ? I think the sortBy is probably in the wrong direction, so change it with -i instead of i

I'm confused why would need in memory sorting. We just use a loop like any other loops in spark. Why shouldn't this solve the problem?:

    val count = lines.count() // lines is the rdd
    val partitionLinesCount = count / rdd.partitions.length
    linesWithIndex = lines.mapPartitionsWithIndex { (pi, it) =>
      var i = pi * partitionLinesCount
      it.map {
        line => (i, line)

        i += 1

      }
    }

Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: How to map each line to (line number, line)?

Reply via email to