Hi,

I have a rdd which is my application data and is huge. I want to join this
with reference data which is also huge to fit in-memory and thus I do not
want to use Broadcast variable.

What other options do I have to perform such joins?

I am using Cassandra as my data store, so should I just query cassandra to
get the reference data needed?

Also when I join two rdds, will it result in rdd scan or would spark do a
hash partition on the two rdds to get the data with same keys on same node?

Thanks
Ankur

Reply via email to