(Indeed, though the OP said it was a requirement that the pairs are
drawn from the same partition.)

On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel
<guillaume.pi...@exensa.com> wrote:
> Hi Aurelien,
>
> Sean's solution is nice, but maybe not completely order-free, since pairs
> will come from the same partition.
>
> The easiest / fastest way to do it in my opinion is to use a random key
> instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness
> of each elements of the pairs, but maybe you don't care since you're
> sampling with replacement already?
>
> val a = rdd.sample(...).map{ x => (rand() % k, x)}
> val b = rdd.sample(...).map{ x => (rand() % k, x)}
>
> k must be ~ the number of elements you're sampling. You'll have  a skewed
> distribution due to collisions, but I don't think it should hurt too much.
>
> Guillaume
>
> Hi everyone,
> However I am not happy with this solution because each element is most
> likely to be paired with elements that are "closeby" in the partition. This
> is because sample returns an "ordered" Iterator.
>
>
>
> --
> Guillaume PITEL, Président
> +33(0)626 222 431
>
> eXenSa S.A.S.
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to