The starting data set is much larger than that, I start from a couple ~20GB
data sets.

Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some
ball park?


On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton <[email protected]>wrote:

> I've done this with a "broadcast". It worked pretty well. Around 10g
> (for the smaller dataset) I started having problems (cf
>
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E
> )
>
> If it's really only 800MB you can probably do this whole thing on a
> cellphone so I'm not sure why RDDs are involved.
>
>
> On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote:
> > Hi,
> >
> > Just wondering what people suggest for joining of 2 RDDs of very
> different
> > sizes
> >
> > I have a sequence of map reduce that will in the end yield me a RDD ~
> 500MB
> > - 800MB  that typically has a couple hundred partitions.
> >
> > After that I want to join that rdd with 2 smaller rdds 1  will be <50MB
> > another probably in the KB range. call them RDDSmall, and RDDTiny.
> >
> > What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)?
> >
> > If RDDSmall has less partitions than RDD, won't the join cause RDD to
> > coalesce into the same number of partitions as RDDSmall, and even worse
> > RDDTiny?
> >
> > tks,
> > shay
> >
>

Reply via email to