I've done this with a "broadcast". It worked pretty well. Around 10g (for the smaller dataset) I started having problems (cf http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E )
If it's really only 800MB you can probably do this whole thing on a cellphone so I'm not sure why RDDs are involved. On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote: > Hi, > > Just wondering what people suggest for joining of 2 RDDs of very different > sizes > > I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB > - 800MB that typically has a couple hundred partitions. > > After that I want to join that rdd with 2 smaller rdds 1 will be <50MB > another probably in the KB range. call them RDDSmall, and RDDTiny. > > What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)? > > If RDDSmall has less partitions than RDD, won't the join cause RDD to > coalesce into the same number of partitions as RDDSmall, and even worse > RDDTiny? > > tks, > shay >
