if you can store the entire sample for one partition in memory, I think you
just want:

val sample1 =
rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
val sample2 = rdd.sample(true,0.01,43)
.mapPartitions(scala.util.Random.shuffle)

...



On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet <
aurelien.bel...@telecom-paristech.fr> wrote:

> Hi Sean,
>
> Thanks a lot for your reply. The problem is that I need to sample random
> *independent* pairs. If I draw two samples and build all n*(n-1) pairs then
> there is a lot of dependency. My current solution is also not satisfying
> because some pairs (the closest ones in a partition) have a much higher
> probability to be sampled. Not sure how to fix this.
>
> Aurelien
>
>
> Le 16/04/2015 20:44, Sean Owen a écrit :
>
>> Use mapPartitions, and then take two random samples of the elements in
>> the partition, and return an iterator over all pairs of them? Should
>> be pretty simple assuming your sample size n is smallish since you're
>> returning ~n^2 pairs.
>>
>> On Thu, Apr 16, 2015 at 7:00 PM, abellet
>> <aurelien.bel...@telecom-paristech.fr> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a large RDD and I am trying to create a RDD of a random sample of
>>> pairs of elements from this RDD. The elements composing a pair should
>>> come
>>> from the same partition for efficiency. The idea I've come up with is to
>>> take two random samples and then use zipPartitions to pair each i-th
>>> element
>>> of the first sample with the i-th element of the second sample. Here is a
>>> sample code illustrating the idea:
>>>
>>> -----------
>>> val rdd = sc.parallelize(1 to 60000, 16)
>>>
>>> val sample1 = rdd.sample(true,0.01,42)
>>> val sample2 = rdd.sample(true,0.01,43)
>>>
>>> def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
>>> {
>>>    var res = List[String]()
>>>    while (s1.hasNext && s2.hasNext)
>>>    {
>>>      val x = s1.next + " " + s2.next
>>>      res ::= x
>>>    }
>>>    res.iterator
>>> }
>>>
>>> val pairs = sample1.zipPartitions(sample2)(myfunc)
>>> -------------
>>>
>>> However I am not happy with this solution because each element is most
>>> likely to be paired with elements that are "closeby" in the partition.
>>> This
>>> is because sample returns an "ordered" Iterator.
>>>
>>> Any idea how to fix this? I did not find a way to efficiently shuffle the
>>> random sample so far.
>>>
>>> Thanks a lot!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to