The starting data set is much larger than that, I start from a couple ~20GB data sets.
Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some ball park? On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton <[email protected]>wrote: > I've done this with a "broadcast". It worked pretty well. Around 10g > (for the smaller dataset) I started having problems (cf > > http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E > ) > > If it's really only 800MB you can probably do this whole thing on a > cellphone so I'm not sure why RDDs are involved. > > > On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote: > > Hi, > > > > Just wondering what people suggest for joining of 2 RDDs of very > different > > sizes > > > > I have a sequence of map reduce that will in the end yield me a RDD ~ > 500MB > > - 800MB that typically has a couple hundred partitions. > > > > After that I want to join that rdd with 2 smaller rdds 1 will be <50MB > > another probably in the KB range. call them RDDSmall, and RDDTiny. > > > > What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)? > > > > If RDDSmall has less partitions than RDD, won't the join cause RDD to > > coalesce into the same number of partitions as RDDSmall, and even worse > > RDDTiny? > > > > tks, > > shay > > >
