Yeah reduce() will leave you with one big collection of sets on the
driver. Maybe the set of all identifiers isn't so big -- a hundred
million Longs even isn't so much. I'm glad to hear cartesian works but
can that scale? you're making an RDD of N^2 elements initially which
is just vast.
On Thu,
Hello,
Most of the tasks I've accomplished in Spark were fairly straightforward but
I can't figure the following problem using the Spark API:
Basically, I have an IP with a bunch of user ID associated to it. I want to
create a list of all user id that are associated together, even if some are
So, given sets, you are joining overlapping sets until all of them are
mutually disjoint, right?
If graphs are out, then I also would love to see a slick distributed
solution, but couldn't think of one. It seems like a cartesian product
won't scale.
You can write a simple method to implement
Ah yes, you're quite right with partitions I could probably process a good
chunk of the data but I didn't think a reduce would work? Sorry, I'm still
new to Spark and map reduce in general but I thought that the reduce result
wasn't an RDD and had to fit into memory. If the result of a reduce can
For what it's worth, I got it to work with a Cartesian product even if it's
very inefficient it worked out alright for me. The trick was to flat map it
(step4) after the cartesian product so that I could do a reduce by key and
find the commonalities. After I was done, I could check if any Value