For two large key-value data sets, if they have the same set of keys, what is the fastest way to join them into one? Suppose all keys are unique in each data set, and we only care about those keys that appear in both data sets.
input data I have: (k, v1) and (k, v2) data I want to get from the input: (k, (v1, v2)). I don't mind using co-group if it's faster, because only minor work needs to be done to convert into the format I need. Join is more straightforward, but I think join assumes the keys are not unique. There could be some performance loss there (I might be wrong here.) Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/performance-comparison-join-vs-cogroup-tp15823.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org