Hello,

To avoid computing all possible combinations, I'm trying to group values
according to a certain key, and then compute the cartesian product of the
values for each key, i.e.:

Input
 [(k1, [v1]), (k1, [v2]), (k2, [v3])]

Desired output:
[(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)]

Currently I'm doing it as follows (product is from Python itertools):
input = sc.textFile('data.csv')
rdd = input.map(lambda x: (x.key, [x]))
rdd2 = rdd.reduceByKey(lambda x, y: x + y)
rdd3 = rdd2.flatMapValues(lambda x: itertools.product(x, x))
result = rdd3.map(lambda x: x[1])

This works fine for very small files, but when the list is of length ~1000
the computation completely freezes.

Thanks in advance!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-cartesian-product-using-keys-tp26361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to