Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-24 Thread Sean Owen
Yeah reduce() will leave you with one big collection of sets on the driver. Maybe the set of all identifiers isn't so big -- a hundred million Longs even isn't so much. I'm glad to hear cartesian works but can that scale? you're making an RDD of N^2 elements initially which is just vast. On Thu,

Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Hello, Most of the tasks I've accomplished in Spark were fairly straightforward but I can't figure the following problem using the Spark API: Basically, I have an IP with a bunch of user ID associated to it. I want to create a list of all user id that are associated together, even if some are

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Sean Owen
So, given sets, you are joining overlapping sets until all of them are mutually disjoint, right? If graphs are out, then I also would love to see a slick distributed solution, but couldn't think of one. It seems like a cartesian product won't scale. You can write a simple method to implement

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Ah yes, you're quite right with partitions I could probably process a good chunk of the data but I didn't think a reduce would work? Sorry, I'm still new to Spark and map reduce in general but I thought that the reduce result wasn't an RDD and had to fit into memory. If the result of a reduce can

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
For what it's worth, I got it to work with a Cartesian product even if it's very inefficient it worked out alright for me. The trick was to flat map it (step4) after the cartesian product so that I could do a reduce by key and find the commonalities. After I was done, I could check if any Value