You should use join: val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6)))) val rdd2 = sc.parallelize(List((2,(1)), (2,(3)), (3,(9))))
rdd1.join(rdd2).collect res0: Array[(Int, (Int, Int))] = Array((2,(5,1)), (2,(5,3)), (3,(6,9))) Please see my cheat sheet at * 3.14 join(otherDataset, [numTasks]) * http://www.openkb.info/2015/01/scala-on-spark-cheatsheet.html On Wed, Feb 4, 2015 at 3:52 PM, dash <[email protected]> wrote: > Hey Spark gurus! Sorry for the confusing title. I do not know the exactly > description of my problem, if you know please tell me so I can change it > :-) > > Say I have two RDDs right now, and they are > > val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6)))) > val rdd2 = sc.parallelize(List((2,(1)), (2,(3)), (3,(9)))) > > I want combine rdd1 and rdd2 to get rdd3 which looks like > > List((1,(3)), (2,(5,1)), (2,(5,3)), (3, (6,9))) > > The order in _._2 does not matter, so you can treat it as a Set. > > I tried to use zip, but since there is no guarantee that the length of rdd1 > and rdd2 will be the same I do not know if it is doable. > > Also I looked into PairedRDD, some people use union operation on two RDDs > and then apply a map function on it. Since I want all combinations > according > to _._1, I do not know how to achieve it by union and map. > > Thanks in advance! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-combination-like-RDD-based-on-two-RDDs-tp21508.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
