Hello,

Most of the tasks I've accomplished in Spark were fairly straightforward but
I can't figure the following problem using the Spark API: 

Basically, I have an IP with a bunch of user ID associated to it. I want to
create a list of all user id that are associated together, even if some are
on different IP.

For example: 

•       IP: 1.24.22.10 / User ID: A, B, C
•       IP: 2.24.30.11 / User ID: C, D, E
•       IP: 3.21.30.11 / User ID: F, Z, E
•       IP: 4.21.30.11 / User ID: T, S, R

The end result Would be something two list: [A,B,C, D, E, F, Z] and [T, S,
R]

What I've tried, is a 

        rdd = sc.parallelize([ frozenset([1, 2]), frozenset([2,3]),
frozenset([3,4]) ])

- Cartesian / Filter ( where I remove item with no user id in common )
- Map: Merge the two user id set into a common set.
- Distinct : Remove duplicates.

I would have to run it a couple of times, but it doesn't quite work because
for example [1,2] would get merged with [1,2] all the time and I would get
stuck with it. ( see below ). I assume there's a common pattern to do this
in mapreduce but I just don't know it :\. I realize it's a graph problem but
spark graph implementation is not available in python yet.

Pass 1:
        SET: frozenset([1, 2, 3])
        SET: frozenset([2, 3, 4])
        SET: frozenset([2, 3])
        SET: frozenset([1, 2])
        SET: frozenset([3, 4])

Pass 2:
        SET: frozenset([1, 2, 3, 4])
        SET: frozenset([1, 2, 3])
        SET: frozenset([2, 3, 4])
        SET: frozenset([2, 3])
        SET: frozenset([1, 2])
        SET: frozenset([3, 4])

Pass 3:
        SET: frozenset([1, 2, 3, 4])
        SET: frozenset([1, 2, 3])
        SET: frozenset([2, 3, 4])
        SET: frozenset([2, 3])
        SET: frozenset([1, 2])
        SET: frozenset([3, 4])




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to