if you can store the entire sample for one partition in memory, I think you just want:
val sample1 = rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle) val sample2 = rdd.sample(true,0.01,43) .mapPartitions(scala.util.Random.shuffle) ... On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet < aurelien.bel...@telecom-paristech.fr> wrote: > Hi Sean, > > Thanks a lot for your reply. The problem is that I need to sample random > *independent* pairs. If I draw two samples and build all n*(n-1) pairs then > there is a lot of dependency. My current solution is also not satisfying > because some pairs (the closest ones in a partition) have a much higher > probability to be sampled. Not sure how to fix this. > > Aurelien > > > Le 16/04/2015 20:44, Sean Owen a écrit : > >> Use mapPartitions, and then take two random samples of the elements in >> the partition, and return an iterator over all pairs of them? Should >> be pretty simple assuming your sample size n is smallish since you're >> returning ~n^2 pairs. >> >> On Thu, Apr 16, 2015 at 7:00 PM, abellet >> <aurelien.bel...@telecom-paristech.fr> wrote: >> >>> Hi everyone, >>> >>> I have a large RDD and I am trying to create a RDD of a random sample of >>> pairs of elements from this RDD. The elements composing a pair should >>> come >>> from the same partition for efficiency. The idea I've come up with is to >>> take two random samples and then use zipPartitions to pair each i-th >>> element >>> of the first sample with the i-th element of the second sample. Here is a >>> sample code illustrating the idea: >>> >>> ----------- >>> val rdd = sc.parallelize(1 to 60000, 16) >>> >>> val sample1 = rdd.sample(true,0.01,42) >>> val sample2 = rdd.sample(true,0.01,43) >>> >>> def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] = >>> { >>> var res = List[String]() >>> while (s1.hasNext && s2.hasNext) >>> { >>> val x = s1.next + " " + s2.next >>> res ::= x >>> } >>> res.iterator >>> } >>> >>> val pairs = sample1.zipPartitions(sample2)(myfunc) >>> ------------- >>> >>> However I am not happy with this solution because each element is most >>> likely to be paired with elements that are "closeby" in the partition. >>> This >>> is because sample returns an "ordered" Iterator. >>> >>> Any idea how to fix this? I did not find a way to efficiently shuffle the >>> random sample so far. >>> >>> Thanks a lot! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >