Thanks for the clarification. What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles.
I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need to set all the storage level for all of my RDDs to something like MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in the presence of coalesces/shuffles, cogroups, and joins? Thanks, Allen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html Sent from the Apache Spark User List mailing list archive at Nabble.com.