performance comparison: join vs cogroup?

freedafeng Mon, 06 Oct 2014 17:17:44 -0700

For two large key-value data sets, if they have the same set of keys, what is
the fastest way to join them into one?  Suppose all keys are unique in each
data set, and we only care about those keys that appear in both data sets.


input data I have: (k, v1) and (k, v2)

data I want to get from the input: (k, (v1, v2)). 

I don't mind using co-group if it's faster, because only minor work needs to
be done to convert into the format I need. Join is more straightforward, but
I think join assumes the keys are not unique. There could be some
performance loss there (I might be wrong here.)

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/performance-comparison-join-vs-cogroup-tp15823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

performance comparison: join vs cogroup?

Reply via email to