Hi, In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there might be a case that even for a single key (an extreme case though), the associated Iterable[V] could resulting in OOM.
Is it possible to provide the above 'groupByKeyWithRDD'? And, ideally, it would be great if the internal impl of the RDD[V] is smart enough to only spill the data into disk upon a configured threshold. That way, we won't sacrifice the performance for the normal cases as well. Any suggestions/comments are welcomed. Thanks a lot! Just a side note: we do understand the points mentioned here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html, and the 'reduceByKey', 'foldByKey' don't quite fit our needs right now, that is to say, we couldn't really avoid 'groupByKey'. -- ChuChao