Re: Broadcast joins on RDD

2015-01-12 Thread Reza Zadeh
First, you should collect().toMap() the small RDD, then you should use
broadcast followed by a map to do a map-side join

(slide
10 has an example).

Spark SQL also does it by default for tables that are smaller than the
spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which is
really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=100 for example).

On Mon, Jan 12, 2015 at 3:15 PM, Pala M Muthaia  wrote:

> Hi,
>
>
> How do i do broadcast/map join on RDDs? I have a large RDD that i want to
> inner join with a small RDD. Instead of having the large RDD repartitioned
> and shuffled for join, i would rather send a copy of a small RDD to each
> task, and then perform the join locally.
>
> How would i specify this in Spark code? I didn't find much documentation
> online. I attempted to create a broadcast variable out of the small RDD and
> then access that in the join operator:
>
> largeRdd.join(smallRddBroadCastVar.value)
>
> but that didn't work as expected ( I found that all rows with same key
> were on same task)
>
> I am using Spark version 1.0.1
>
>
> Thanks,
> pala
>
>
>


Broadcast joins on RDD

2015-01-12 Thread Pala M Muthaia
Hi,


How do i do broadcast/map join on RDDs? I have a large RDD that i want to
inner join with a small RDD. Instead of having the large RDD repartitioned
and shuffled for join, i would rather send a copy of a small RDD to each
task, and then perform the join locally.

How would i specify this in Spark code? I didn't find much documentation
online. I attempted to create a broadcast variable out of the small RDD and
then access that in the join operator:

largeRdd.join(smallRddBroadCastVar.value)

but that didn't work as expected ( I found that all rows with same key were
on same task)

I am using Spark version 1.0.1


Thanks,
pala