Re: spark-sorted, or secondary sort and streaming reduce for spark

Koert Kuipers Fri, 06 Mar 2015 15:19:38 -0800

i added it

On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz <brk...@gmail.com> wrote:


> Hi Koert,
>
> Would you like to register this on spark-packages.org?
>
> Burak
>
> On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> currently spark provides many excellent algorithms for operations per key
>> as long as the data send to the reducers per key fits in memory. operations
>> like combineByKey, reduceByKey and foldByKey rely on pushing the operation
>> map-side so that the data reduce-side is small. and groupByKey simply
>> requires that the values per key fit in memory.
>>
>> but there are algorithms for which we would like to process all the
>> values per key reduce-side, even when they do not fit in memory. examples
>> are algorithms that need to process the values ordered, or algorithms that
>> need to emit all values again. basically this is what the original hadoop
>> reduce operation did so well: it allowed sorting of values (using secondary
>> sort), and it processed all values per key in a streaming fashion.
>>
>> the library spark-sorted aims to bring these kind of operations back to
>> spark, by providing a way to process values with a user provided
>> Ordering[V] and a user provided streaming operation Iterator[V] =>
>> Iterator[W]. it does not make the assumption that the values need to fit in
>> memory per key.
>>
>> the basic idea is to rely on spark's sort-based shuffle to re-arrange the
>> data so that all values for a given key are placed consecutively within a
>> single partition, and then process them using a map-like operation.
>>
>> you can find the project here:
>> https://github.com/tresata/spark-sorted
>>
>> the project is in a very early stage. any feedback is very much
>> appreciated.
>>
>>
>>
>>
>

Re: spark-sorted, or secondary sort and streaming reduce for spark

Reply via email to