I have two RDDs, one really large in size and other much smaller. I'd like
find all unique tuples in large RDD with keys from the small RDD. There are
duplicates tuples as well and I only care about the distinct tuples.
For example
large_rdd = sc.parallelize([('abcdefghij'[i%10], i) for i in
c.broadcast(small_rdd)
>
>
> then large_dd.filter(x.key in bc.value).map( x => {
> bc.value.other_fileds + x
> }).distinct.groupByKey
>
>
>
>
>
>
> On Dec 7, 2015, at 1:41 PM, Z Z <zonked.zo...@gmail.com> wrote:
>
> I have two RDDs, one really