Sincerely thank you for your work about spark. It simplifies the parallel
and iteration process of program. It means a lot for us.


Our program on spark face a small problem and we seek for your help to find
an efficient way to solve this problem.

 

Environments:

We run spark in standalone mode. We have two RDDs:

RDD detail:

Type one RDD: generated from sc.textFile (‘file’),each line in ‘file’ is a
list of keys, like the following lines:


1 149 255 2238 4480 5951 7276 7368 14670 12661 13060 13450 14674

1 149 255 2238 4480 5951 7276 7368 7678 12672 13078 13450 14674

1 149 257 2239 4485 5952 7276 7368 7678 12683 13096 13450 14674

1 149 259 2241 4487 5954 7276 7368 7678 12683 13096 14673 14674

1 149 260 2242 4488 5955 7276 7368 14670 14671 14672 14673 14674

1 151 258 2240 4486 5953 7276 7368 14670 12684 13096 13450 14674

1 151 258 2240 4486 5953 7276 7368 14670 14671 14672 13450 14674

1 151 259 2241 4487 5954 7276 7368 7678 12683 13096 13450 14674

1 153 250 2237 4472 5950 7276 7368 14670 14671 13078 14673 14674

1 153 258 2240 4486 5953 7276 7368 7678 12683 13096 14673 14674

...

Type two RDD: a set of (key, value).

 

The problem we want to solve:

For each line in RDD one, we need to use the keys of the line to search the
value according to the key in RDD of type two. And finally get the sum of
these values.

 

Other Details:

The number of the keys in one line of type one RDD is about 50. The size of
RDD one file is about 10GB.

The biggest number of key in RDD two is about 4500000, and we will not
storage the (key, value) if value is zero.

And maybe the type one RDD has a lot key numbers of 1 but a few of 15877.

 

We want to fine a fast way to solve this problem.

Sincerely thanks


    Bo Han
.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to