Re: Random pairs / RDD order

2015-04-19 Thread Aurélien Bellet

Hi Imran,

Thanks for the suggestion! Unfortunately the type does not match. But I 
could write my own function that shuffle the sample though.


Le 4/17/15 9:34 PM, Imran Rashid a écrit :

if you can store the entire sample for one partition in memory, I think
you just want:

val sample1 =
rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
val sample2 =
rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle)

...



On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet
aurelien.bel...@telecom-paristech.fr
mailto:aurelien.bel...@telecom-paristech.fr wrote:

Hi Sean,

Thanks a lot for your reply. The problem is that I need to sample
random *independent* pairs. If I draw two samples and build all
n*(n-1) pairs then there is a lot of dependency. My current solution
is also not satisfying because some pairs (the closest ones in a
partition) have a much higher probability to be sampled. Not sure
how to fix this.

Aurelien


Le 16/04/2015 20:44, Sean Owen a écrit :

Use mapPartitions, and then take two random samples of the
elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since
you're
returning ~n^2 pairs.

On Thu, Apr 16, 2015 at 7:00 PM, abellet
aurelien.bel...@telecom-paristech.fr
mailto:aurelien.bel...@telecom-paristech.fr wrote:

Hi everyone,

I have a large RDD and I am trying to create a RDD of a
random sample of
pairs of elements from this RDD. The elements composing a
pair should come
from the same partition for efficiency. The idea I've come
up with is to
take two random samples and then use zipPartitions to pair
each i-th element
of the first sample with the i-th element of the second
sample. Here is a
sample code illustrating the idea:

---
val rdd = sc.parallelize(1 to 6, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]):
Iterator[String] =
{
var res = List[String]()
while (s1.hasNext  s2.hasNext)
{
  val x = s1.next +   + s2.next
  res ::= x
}
res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-

However I am not happy with this solution because each
element is most
likely to be paired with elements that are closeby in the
partition. This
is because sample returns an ordered Iterator.

Any idea how to fix this? I did not find a way to
efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Random pairs / RDD order

2015-04-17 Thread Aurélien Bellet

Hi Sean,

Thanks a lot for your reply. The problem is that I need to sample random 
*independent* pairs. If I draw two samples and build all n*(n-1) pairs 
then there is a lot of dependency. My current solution is also not 
satisfying because some pairs (the closest ones in a partition) have a 
much higher probability to be sampled. Not sure how to fix this.


Aurelien

Le 16/04/2015 20:44, Sean Owen a écrit :

Use mapPartitions, and then take two random samples of the elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since you're
returning ~n^2 pairs.

On Thu, Apr 16, 2015 at 7:00 PM, abellet
aurelien.bel...@telecom-paristech.fr wrote:

Hi everyone,

I have a large RDD and I am trying to create a RDD of a random sample of
pairs of elements from this RDD. The elements composing a pair should come
from the same partition for efficiency. The idea I've come up with is to
take two random samples and then use zipPartitions to pair each i-th element
of the first sample with the i-th element of the second sample. Here is a
sample code illustrating the idea:

---
val rdd = sc.parallelize(1 to 6, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
{
   var res = List[String]()
   while (s1.hasNext  s2.hasNext)
   {
 val x = s1.next +   + s2.next
 res ::= x
   }
   res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-

However I am not happy with this solution because each element is most
likely to be paired with elements that are closeby in the partition. This
is because sample returns an ordered Iterator.

Any idea how to fix this? I did not find a way to efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone,

I have a large RDD and I am trying to create a RDD of a random sample of
pairs of elements from this RDD. The elements composing a pair should come
from the same partition for efficiency. The idea I've come up with is to
take two random samples and then use zipPartitions to pair each i-th element
of the first sample with the i-th element of the second sample. Here is a
sample code illustrating the idea:

---
val rdd = sc.parallelize(1 to 6, 16)

val sample1 = rdd.sample(true,0.01,42)
val sample2 = rdd.sample(true,0.01,43)

def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
{
  var res = List[String]()
  while (s1.hasNext  s2.hasNext)
  {
val x = s1.next +   + s2.next
res ::= x
  }
  res.iterator
}

val pairs = sample1.zipPartitions(sample2)(myfunc)
-

However I am not happy with this solution because each element is most
likely to be paired with elements that are closeby in the partition. This
is because sample returns an ordered Iterator.

Any idea how to fix this? I did not find a way to efficiently shuffle the
random sample so far.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Random pairs / RDD order

2015-04-16 Thread Guillaume Pitel

Hi Aurelien,

Sean's solution is nice, but maybe not completely order-free, since 
pairs will come from the same partition.


The easiest / fastest way to do it in my opinion is to use a random key 
instead of a zipWithIndex. Of course you'll not be able to ensure 
uniqueness of each elements of the pairs, but maybe you don't care since 
you're sampling with replacement already?


val a = rdd.sample(...).map{ x = (rand() % k, x)}
val b = rdd.sample(...).map{ x = (rand() % k, x)}

k must be ~ the number of elements you're sampling. You'll have  a 
skewed distribution due to collisions, but I don't think it should hurt 
too much.


Guillaume

Hi everyone,
However I am not happy with this solution because each element is most
likely to be paired with elements that are closeby in the partition. This
is because sample returns an ordered Iterator.




--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
(Indeed, though the OP said it was a requirement that the pairs are
drawn from the same partition.)

On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel
guillaume.pi...@exensa.com wrote:
 Hi Aurelien,

 Sean's solution is nice, but maybe not completely order-free, since pairs
 will come from the same partition.

 The easiest / fastest way to do it in my opinion is to use a random key
 instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness
 of each elements of the pairs, but maybe you don't care since you're
 sampling with replacement already?

 val a = rdd.sample(...).map{ x = (rand() % k, x)}
 val b = rdd.sample(...).map{ x = (rand() % k, x)}

 k must be ~ the number of elements you're sampling. You'll have  a skewed
 distribution due to collisions, but I don't think it should hurt too much.

 Guillaume

 Hi everyone,
 However I am not happy with this solution because each element is most
 likely to be paired with elements that are closeby in the partition. This
 is because sample returns an ordered Iterator.



 --
 Guillaume PITEL, Président
 +33(0)626 222 431

 eXenSa S.A.S.
 41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
Use mapPartitions, and then take two random samples of the elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since you're
returning ~n^2 pairs.

On Thu, Apr 16, 2015 at 7:00 PM, abellet
aurelien.bel...@telecom-paristech.fr wrote:
 Hi everyone,

 I have a large RDD and I am trying to create a RDD of a random sample of
 pairs of elements from this RDD. The elements composing a pair should come
 from the same partition for efficiency. The idea I've come up with is to
 take two random samples and then use zipPartitions to pair each i-th element
 of the first sample with the i-th element of the second sample. Here is a
 sample code illustrating the idea:

 ---
 val rdd = sc.parallelize(1 to 6, 16)

 val sample1 = rdd.sample(true,0.01,42)
 val sample2 = rdd.sample(true,0.01,43)

 def myfunc(s1: Iterator[Int], s2: Iterator[Int]): Iterator[String] =
 {
   var res = List[String]()
   while (s1.hasNext  s2.hasNext)
   {
 val x = s1.next +   + s2.next
 res ::= x
   }
   res.iterator
 }

 val pairs = sample1.zipPartitions(sample2)(myfunc)
 -

 However I am not happy with this solution because each element is most
 likely to be paired with elements that are closeby in the partition. This
 is because sample returns an ordered Iterator.

 Any idea how to fix this? I did not find a way to efficiently shuffle the
 random sample so far.

 Thanks a lot!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org