Hello, To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.:
Input [(k1, [v1]), (k1, [v2]), (k2, [v3])] Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Currently I'm doing it as follows (product is from Python itertools): input = sc.textFile('data.csv') rdd = input.map(lambda x: (x.key, [x])) rdd2 = rdd.reduceByKey(lambda x, y: x + y) rdd3 = rdd2.flatMapValues(lambda x: itertools.product(x, x)) result = rdd3.map(lambda x: x[1]) This works fine for very small files, but when the list is of length ~1000 the computation completely freezes. Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-cartesian-product-using-keys-tp26361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org