Hi, I have a rdd which is my application data and is huge. I want to join this with reference data which is also huge to fit in-memory and thus I do not want to use Broadcast variable.
What other options do I have to perform such joins? I am using Cassandra as my data store, so should I just query cassandra to get the reference data needed? Also when I join two rdds, will it result in rdd scan or would spark do a hash partition on the two rdds to get the data with same keys on same node? Thanks Ankur