My broadcast joins were working fine when the small data set was several GB, though that probably has more to do with the computer I was using than anything else. Switching code between a regular join and a broadcast join is easy. I basically copied the example here: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf and it worked.
On Thu, Nov 14, 2013 at 12:07 PM, Shay Seng <[email protected]> wrote: > The starting data set is much larger than that, I start from a couple ~20GB > data sets. > > Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some > ball park? > > > On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton <[email protected]> > wrote: >> >> I've done this with a "broadcast". It worked pretty well. Around 10g >> (for the smaller dataset) I started having problems (cf >> >> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E >> ) >> >> If it's really only 800MB you can probably do this whole thing on a >> cellphone so I'm not sure why RDDs are involved. >> >> >> On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote: >> > Hi, >> > >> > Just wondering what people suggest for joining of 2 RDDs of very >> > different >> > sizes >> > >> > I have a sequence of map reduce that will in the end yield me a RDD ~ >> > 500MB >> > - 800MB that typically has a couple hundred partitions. >> > >> > After that I want to join that rdd with 2 smaller rdds 1 will be <50MB >> > another probably in the KB range. call them RDDSmall, and RDDTiny. >> > >> > What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)? >> > >> > If RDDSmall has less partitions than RDD, won't the join cause RDD to >> > coalesce into the same number of partitions as RDDSmall, and even worse >> > RDDTiny? >> > >> > tks, >> > shay >> > > >
