Forgot to add an important point. My sequence files are sorted (they're
actually Hadoop map files). Since they're sorted, it makes sense to do a
fetch at the partition-level of the inner sequence file.

Thanks,
Ameet


On Fri, Oct 18, 2013 at 5:20 PM, Ameet Kini <[email protected]> wrote:

>
> I've seen discussions where the suggestion is to do a map-side join, but
> haven't seen an example yet, and can certainly use one. I have two sequence
> files where the key is unique within each file, so the join is a one-to-one
> join, and can hence benefit from a map-side join. However both sequence
> files can be large, so reading one of them completely in the driver and
> broadcasting it out would be expensive.
>
> I don't think there is a map-side join implementation in Spark but earlier
> suggestions have been to write one using mapPartitions on one of the
> operands as the outer loop. If that is the case, how would I fetch the
> split corresponding to the keys in the outer's partition. I'd prefer to do
> a fetch-per-partition rather than a fetch-per-tuple.
>
> In any case, some feedback, and preferably, an example of a map-side join
> without broadcasting would help.
>
> Thanks,
> Ameet
>

Reply via email to