currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply requires that the values per key fit in memory.
but there are algorithms for which we would like to process all the values per key reduce-side, even when they do not fit in memory. examples are algorithms that need to process the values ordered, or algorithms that need to emit all values again. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values with a user provided Ordering[V] and a user provided streaming operation Iterator[V] => Iterator[W]. it does not make the assumption that the values need to fit in memory per key. the basic idea is to rely on spark's sort-based shuffle to re-arrange the data so that all values for a given key are placed consecutively within a single partition, and then process them using a map-like operation. you can find the project here: https://github.com/tresata/spark-sorted the project is in a very early stage. any feedback is very much appreciated.