Hello, Most of the tasks I've accomplished in Spark were fairly straightforward but I can't figure the following problem using the Spark API:
Basically, I have an IP with a bunch of user ID associated to it. I want to create a list of all user id that are associated together, even if some are on different IP. For example: • IP: 1.24.22.10 / User ID: A, B, C • IP: 2.24.30.11 / User ID: C, D, E • IP: 3.21.30.11 / User ID: F, Z, E • IP: 4.21.30.11 / User ID: T, S, R The end result Would be something two list: [A,B,C, D, E, F, Z] and [T, S, R] What I've tried, is a rdd = sc.parallelize([ frozenset([1, 2]), frozenset([2,3]), frozenset([3,4]) ]) - Cartesian / Filter ( where I remove item with no user id in common ) - Map: Merge the two user id set into a common set. - Distinct : Remove duplicates. I would have to run it a couple of times, but it doesn't quite work because for example [1,2] would get merged with [1,2] all the time and I would get stuck with it. ( see below ). I assume there's a common pattern to do this in mapreduce but I just don't know it :\. I realize it's a graph problem but spark graph implementation is not available in python yet. Pass 1: SET: frozenset([1, 2, 3]) SET: frozenset([2, 3, 4]) SET: frozenset([2, 3]) SET: frozenset([1, 2]) SET: frozenset([3, 4]) Pass 2: SET: frozenset([1, 2, 3, 4]) SET: frozenset([1, 2, 3]) SET: frozenset([2, 3, 4]) SET: frozenset([2, 3]) SET: frozenset([1, 2]) SET: frozenset([3, 4]) Pass 3: SET: frozenset([1, 2, 3, 4]) SET: frozenset([1, 2, 3]) SET: frozenset([2, 3, 4]) SET: frozenset([2, 3]) SET: frozenset([1, 2]) SET: frozenset([3, 4]) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530.html Sent from the Apache Spark User List mailing list archive at Nabble.com.