i added it On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz <brk...@gmail.com> wrote:
> Hi Koert, > > Would you like to register this on spark-packages.org? > > Burak > > On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> currently spark provides many excellent algorithms for operations per key >> as long as the data send to the reducers per key fits in memory. operations >> like combineByKey, reduceByKey and foldByKey rely on pushing the operation >> map-side so that the data reduce-side is small. and groupByKey simply >> requires that the values per key fit in memory. >> >> but there are algorithms for which we would like to process all the >> values per key reduce-side, even when they do not fit in memory. examples >> are algorithms that need to process the values ordered, or algorithms that >> need to emit all values again. basically this is what the original hadoop >> reduce operation did so well: it allowed sorting of values (using secondary >> sort), and it processed all values per key in a streaming fashion. >> >> the library spark-sorted aims to bring these kind of operations back to >> spark, by providing a way to process values with a user provided >> Ordering[V] and a user provided streaming operation Iterator[V] => >> Iterator[W]. it does not make the assumption that the values need to fit in >> memory per key. >> >> the basic idea is to rely on spark's sort-based shuffle to re-arrange the >> data so that all values for a given key are placed consecutively within a >> single partition, and then process them using a map-like operation. >> >> you can find the project here: >> https://github.com/tresata/spark-sorted >> >> the project is in a very early stage. any feedback is very much >> appreciated. >> >> >> >> >