My broadcast joins were working fine when the small data set was
several GB, though that probably has more to do with the computer I
was using than anything else. Switching code between a regular join
and a broadcast join is easy. I basically copied the example here:
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
and it worked.

On Thu, Nov 14, 2013 at 12:07 PM, Shay Seng <[email protected]> wrote:
> The starting data set is much larger than that, I start from a couple ~20GB
> data sets.
>
> Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some
> ball park?
>
>
> On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton <[email protected]>
> wrote:
>>
>> I've done this with a "broadcast". It worked pretty well. Around 10g
>> (for the smaller dataset) I started having problems (cf
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E
>> )
>>
>> If it's really only 800MB you can probably do this whole thing on a
>> cellphone so I'm not sure why RDDs are involved.
>>
>>
>> On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote:
>> > Hi,
>> >
>> > Just wondering what people suggest for joining of 2 RDDs of very
>> > different
>> > sizes
>> >
>> > I have a sequence of map reduce that will in the end yield me a RDD ~
>> > 500MB
>> > - 800MB  that typically has a couple hundred partitions.
>> >
>> > After that I want to join that rdd with 2 smaller rdds 1  will be <50MB
>> > another probably in the KB range. call them RDDSmall, and RDDTiny.
>> >
>> > What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)?
>> >
>> > If RDDSmall has less partitions than RDD, won't the join cause RDD to
>> > coalesce into the same number of partitions as RDDSmall, and even worse
>> > RDDTiny?
>> >
>> > tks,
>> > shay
>> >
>
>

Reply via email to