Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Z Z
I have two RDDs, one really large in size and other much smaller. I'd like find all unique tuples in large RDD with keys from the small RDD. There are duplicates tuples as well and I only care about the distinct tuples. For example large_rdd = sc.parallelize([('abcdefghij'[i%10], i) for i in

Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Z Z
c.broadcast(small_rdd) > > > then large_dd.filter(x.key in bc.value).map( x => { > bc.value.other_fileds + x > }).distinct.groupByKey > > > > > > > On Dec 7, 2015, at 1:41 PM, Z Z <zonked.zo...@gmail.com> wrote: > > I have two RDDs, one really