I've done this with a "broadcast". It worked pretty well. Around 10g
(for the smaller dataset) I started having problems (cf
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E
)

If it's really only 800MB you can probably do this whole thing on a
cellphone so I'm not sure why RDDs are involved.


On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng <[email protected]> wrote:
> Hi,
>
> Just wondering what people suggest for joining of 2 RDDs of very different
> sizes
>
> I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB
> - 800MB  that typically has a couple hundred partitions.
>
> After that I want to join that rdd with 2 smaller rdds 1  will be <50MB
> another probably in the KB range. call them RDDSmall, and RDDTiny.
>
> What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)?
>
> If RDDSmall has less partitions than RDD, won't the join cause RDD to
> coalesce into the same number of partitions as RDDSmall, and even worse
> RDDTiny?
>
> tks,
> shay
>

Reply via email to