Hi, Just wondering what people suggest for joining of 2 RDDs of very different sizes
I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB - 800MB that typically has a couple hundred partitions. After that I want to join that rdd with 2 smaller rdds 1 will be <50MB another probably in the KB range. call them RDDSmall, and RDDTiny. What is the most efficient way to RDD.join(RDDSmall).join(RDDTiny)? If RDDSmall has less partitions than RDD, won't the join cause RDD to coalesce into the same number of partitions as RDDSmall, and even worse RDDTiny? tks, shay
