(Indeed, though the OP said it was a requirement that the pairs are drawn from the same partition.)
On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel <guillaume.pi...@exensa.com> wrote: > Hi Aurelien, > > Sean's solution is nice, but maybe not completely order-free, since pairs > will come from the same partition. > > The easiest / fastest way to do it in my opinion is to use a random key > instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness > of each elements of the pairs, but maybe you don't care since you're > sampling with replacement already? > > val a = rdd.sample(...).map{ x => (rand() % k, x)} > val b = rdd.sample(...).map{ x => (rand() % k, x)} > > k must be ~ the number of elements you're sampling. You'll have a skewed > distribution due to collisions, but I don't think it should hurt too much. > > Guillaume > > Hi everyone, > However I am not happy with this solution because each element is most > likely to be paired with elements that are "closeby" in the partition. This > is because sample returns an "ordered" Iterator. > > > > -- > Guillaume PITEL, Président > +33(0)626 222 431 > > eXenSa S.A.S. > 41, rue Périer - 92120 Montrouge - FRANCE > Tel +33(0)184 163 677 / Fax +33(0)972 283 705 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org