If the size of the biggest RDD 2 (4500000) fits in memory on your driver
node and each worker you can collect it, then broadcast it to the workers
and use it in a map function. Assuming you have at most 4 500 000 entries,
each being a pair of ints (4 bytes), it would be less than 1 Gigabyte
(around 0.03 if I am not wrong). So probably you can try this solution. If
you worry about the network traffic, doing groupBy etc would anyway incur
traffic due to shuffled data. The main problem here is that your driver
will have work to do.


2014-02-18 11:52 GMT+01:00 Guillaume Pitel <[email protected]>:

>  Here's what I would do :
>
> RDD1 :
> "1" "2" "3"
> "1" "3" "5"
>
> RDD2 :
> ("1", 11)
> ("2", 22)
> ("3", 33)
> ("5", 55)
>
> 1 / flatMap your lines from RDD1 to RDD1bis (key,lineId) (you'll have to
> use a mapPartitionWithIndex for that in order to produce a lineId)
> So you go from this :
>
> "1" "2" "3"
> "1" "3" "5"
>
> to :
>
> ("1","L1")
> ("2","L1")
> ("3","L1")
> ("1","L2")
> ("3","L2")
> ("5","L2")
>
> 2 / join RDD1 and RDD2 => RDD1+2
>
> ("1",("L1",11))
> ("2",("L1",22))
> ("3",("L1",33))
> ("1",("L2",11))
> ("3",("L2",33))
> ("5",("L2",55))
>
> 3/ map RDD1+2 to (_2)
> 4/ groupBy _1 (lineIds) and sum
> ("L1",[11,22,33])
> ("L2",[11,33,55])
>
> Guillaume
>
>  Sincerely thank you for your work about spark. It simplifies the parallel
> and iteration process of program. It means a lot for us.
>
>
> Our program on spark face a small problem and we seek for your help to find
> an efficient way to solve this problem.
>
>
>
> Environments:
>
> We run spark in standalone mode. We have two RDDs:
>
> RDD detail:
>
> Type one RDD: generated from sc.textFile ('file'),each line in 'file' is a
> list of keys, like the following lines:
>
>
> 1 149 255 2238 4480 5951 7276 7368 14670 12661 13060 13450 14674
>
> 1 149 255 2238 4480 5951 7276 7368 7678 12672 13078 13450 14674
>
> 1 149 257 2239 4485 5952 7276 7368 7678 12683 13096 13450 14674
>
> 1 149 259 2241 4487 5954 7276 7368 7678 12683 13096 14673 14674
>
> 1 149 260 2242 4488 5955 7276 7368 14670 14671 14672 14673 14674
>
> 1 151 258 2240 4486 5953 7276 7368 14670 12684 13096 13450 14674
>
> 1 151 258 2240 4486 5953 7276 7368 14670 14671 14672 13450 14674
>
> 1 151 259 2241 4487 5954 7276 7368 7678 12683 13096 13450 14674
>
> 1 153 250 2237 4472 5950 7276 7368 14670 14671 13078 14673 14674
>
> 1 153 258 2240 4486 5953 7276 7368 7678 12683 13096 14673 14674
>
> ...
>
> Type two RDD: a set of (key, value).
>
>
>
> The problem we want to solve:
>
> For each line in RDD one, we need to use the keys of the line to search the
> value according to the key in RDD of type two. And finally get the sum of
> these values.
>
>
>
> Other Details:
>
> The number of the keys in one line of type one RDD is about 50. The size of
> RDD one file is about 10GB.
>
> The biggest number of key in RDD two is about 4500000, and we will not
> storage the (key, value) if value is zero.
>
> And maybe the type one RDD has a lot key numbers of 1 but a few of 15877.
>
>
>
> We want to fine a fast way to solve this problem.
>
> Sincerely thanks
>
>
>     Bo Han
> .
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<inline: exensa_logo_mail.png>>

Reply via email to