On Mon, Dec 30, 2013 at 6:48 PM, Guillaume Pitel <[email protected]
> wrote:

>
>
>
>         It won't last for long = for now my dataset are small enough, but
>> I'll have to change it someday
>>
>
>  How does it depend on the dataset size? Are you saying zipWithIndex is
> slow for bigger datasets?
>
>
> No, but for a multi-tens of billion elements dataset, you cannot fit your
> elements in memory on a single host.
>
> So at some point, the solution I currently use (not the one I sent) :
> dataset.collect().zipWithIndex just won't scale.
>
> Did you try the code I sent ? I think the sortBy is probably in the wrong
> direction, so change it with -i instead of i
>

I'm confused why would need in memory sorting. We just use a loop like any
other loops in spark. Why shouldn't this solve the problem?:

    val count = lines.count() // lines is the rdd
    val partitionLinesCount = count / rdd.partitions.length
    linesWithIndex = lines.mapPartitionsWithIndex { (pi, it) =>
      var i = pi * partitionLinesCount
      it.map {
        *line => (i, line)*
        i += 1
      }
    }


>
> Guillaume
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
>  eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<exensa_logo_mail.png>>

Reply via email to