Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang
Maybe you can refer sliding method of RDD, but it's right now mllib private
method.
Look at org.apache.spark.mllib.rdd.RDDFunctions.


2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com:

 Can you paste the code?  It's unclear to me how/when the out of memory is
 occurring without seeing the code.




 On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote:

 Hello everyone,
 I am transplanting a clustering algorithm to spark platform, and I
 meet a problem confusing me for a long time, can someone help me?

 I have a PairRDDInteger, Integer named patternRDD, which the key
 represents a number and the value stores an information of the key. And I
 want to use two of the VALUEs to calculate a kendall number, and if the
 number is greater than 0.6, then output the two KEYs.

 I have tried to transform the PairRDD to a RDDTuple2Integer,
 Integer, and add a common key zero to them, and join two together then
 get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2,
 value2, and tried to use values() method and map the keys out, but it
 gives me an out of memory error. I think the out of memory error is
 caused by the few entries of my RDD, but I have no idea how to solve it.

  Can you help me?

 Regards,
 Gefei Li





Re: How to join two PairRDD together?

2014-08-28 Thread Sean Owen
It sounds like you are adding the same key to every element, and joining,
in order to accomplish a full cartesian join? I can imagine doing it that
way would blow up somewhere. There is a cartesian() method to do this maybe
more efficiently.

However if your data set is large, this sort of algorithm for computing
Kendall's tau is going to be very slow since it's N^2 and would create an
unspeakably large shuffle. There are faster algorithms for this statistic.
Also consider sampling your data and computing the join over a small sample
to estimate the statistic.


On Thu, Aug 28, 2014 at 11:15 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Maybe you can refer sliding method of RDD, but it's right now mllib
 private method.
 Look at org.apache.spark.mllib.rdd.RDDFunctions.


 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com:

 Can you paste the code?  It's unclear to me how/when the out of memory is
 occurring without seeing the code.




 On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com
 wrote:

 Hello everyone,
 I am transplanting a clustering algorithm to spark platform, and I
 meet a problem confusing me for a long time, can someone help me?

 I have a PairRDDInteger, Integer named patternRDD, which the key
 represents a number and the value stores an information of the key. And I
 want to use two of the VALUEs to calculate a kendall number, and if the
 number is greater than 0.6, then output the two KEYs.

 I have tried to transform the PairRDD to a RDDTuple2Integer,
 Integer, and add a common key zero to them, and join two together then
 get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2,
 value2, and tried to use values() method and map the keys out, but it
 gives me an out of memory error. I think the out of memory error is
 caused by the few entries of my RDD, but I have no idea how to solve it.

  Can you help me?

 Regards,
 Gefei Li






Re: How to join two PairRDD together?

2014-08-25 Thread Vida Ha
Can you paste the code?  It's unclear to me how/when the out of memory is
occurring without seeing the code.




On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote:

 Hello everyone,
 I am transplanting a clustering algorithm to spark platform, and I
 meet a problem confusing me for a long time, can someone help me?

 I have a PairRDDInteger, Integer named patternRDD, which the key
 represents a number and the value stores an information of the key. And I
 want to use two of the VALUEs to calculate a kendall number, and if the
 number is greater than 0.6, then output the two KEYs.

 I have tried to transform the PairRDD to a RDDTuple2Integer,
 Integer, and add a common key zero to them, and join two together then
 get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2,
 value2, and tried to use values() method and map the keys out, but it
 gives me an out of memory error. I think the out of memory error is
 caused by the few entries of my RDD, but I have no idea how to solve it.

  Can you help me?

 Regards,
 Gefei Li