It sounds like you are adding the same key to every element, and joining,
in order to accomplish a full cartesian join? I can imagine doing it that
way would blow up somewhere. There is a cartesian() method to do this maybe
more efficiently.

However if your data set is large, this sort of algorithm for computing
Kendall's tau is going to be very slow since it's N^2 and would create an
unspeakably large shuffle. There are faster algorithms for this statistic.
Also consider sampling your data and computing the join over a small sample
to estimate the statistic.


On Thu, Aug 28, 2014 at 11:15 AM, Yanbo Liang <yanboha...@gmail.com> wrote:

> Maybe you can refer sliding method of RDD, but it's right now mllib
> private method.
> Look at org.apache.spark.mllib.rdd.RDDFunctions.
>
>
> 2014-08-26 12:59 GMT+08:00 Vida Ha <v...@databricks.com>:
>
>> Can you paste the code?  It's unclear to me how/when the out of memory is
>> occurring without seeing the code.
>>
>>
>>
>>
>> On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li <gefeili.2...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>>     I am transplanting a clustering algorithm to spark platform, and I
>>> meet a problem confusing me for a long time, can someone help me?
>>>
>>>     I have a PairRDD<Integer, Integer> named patternRDD, which the key
>>> represents a number and the value stores an information of the key. And I
>>> want to use two of the VALUEs to calculate a kendall number, and if the
>>> number is greater than 0.6, then output the two KEYs.
>>>
>>>     I have tried to transform the PairRDD to a RDD<Tuple2<Integer,
>>> Integer>>, and add a common key zero to them, and join two together then
>>> get a PairRDD<0, Iterable<Tuple2<Tuple2<key1, value1>, Tuple2<key2,
>>> value2>>>>, and tried to use values() method and map the keys out, but it
>>> gives me an "out of memory" error. I think the "out of memory" error is
>>> caused by the few entries of my RDD, but I have no idea how to solve it.
>>>
>>>      Can you help me?
>>>
>>> Regards,
>>> Gefei Li
>>>
>>
>>
>

Reply via email to