Yeah reduce() will leave you with one big collection of sets on the driver. Maybe the set of all identifiers isn't so big -- a hundred million Longs even isn't so much. I'm glad to hear cartesian works but can that scale? you're making an RDD of N^2 elements initially which is just vast.
On Thu, Jul 24, 2014 at 2:09 AM, Roch Denis <rde...@exostatic.com> wrote: > Ah yes, you're quite right with partitions I could probably process a good > chunk of the data but I didn't think a reduce would work? Sorry, I'm still > new to Spark and map reduce in general but I thought that the reduce result > wasn't an RDD and had to fit into memory. If the result of a reduce can be > any size, then yes I can see how to make it work. > > Sorry for not being certain, the doc is not quite clear on that point, at > least to me. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530p10556.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.