currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
requires that the values per key fit in memory.

but there are algorithms for which we would like to process all the values
per key reduce-side, even when they do not fit in memory. examples are
algorithms that need to process the values ordered, or algorithms that need
to emit all values again. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.

the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values with a user provided
Ordering[V] and a user provided streaming operation Iterator[V] =>
Iterator[W]. it does not make the assumption that the values need to fit in
memory per key.

the basic idea is to rely on spark's sort-based shuffle to re-arrange the
data so that all values for a given key are placed consecutively within a
single partition, and then process them using a map-like operation.

you can find the project here:
https://github.com/tresata/spark-sorted

the project is in a very early stage. any feedback is very much appreciated.

Reply via email to