Re: ReduceByKey and sorting within partitions

Burak Yavuz Mon, 04 May 2015 13:07:37 -0700

I think this Spark Package may be what you're looking for!
http://spark-packages.org/package/tresata/spark-sorted


Best,
Burak

On Mon, May 4, 2015 at 12:56 PM, Imran Rashid <[email protected]> wrote:

> oh wow, that is a really interesting observation, Marco & Jerry.
> I wonder if this is worth exposing in combineByKey()?  I think Jerry's
> proposed workaround is all you can do for now -- use reflection to
> side-step the fact that the methods you need are private.
>
> On Mon, Apr 27, 2015 at 8:07 AM, Saisai Shao <[email protected]>
> wrote:
>
>> Hi Marco,
>>
>> As I know, current combineByKey() does not expose the related argument
>> where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is
>> package private, if you can get the ShuffledRDD through reflection or other
>> way, the keyOrdering you set will be pushed down to shuffle. If you use a
>> combination of transformations to do it, the result will be same but the
>> efficiency may be different, some transformations will separate into
>> different stages, which will introduce additional shuffle.
>>
>> Thanks
>> Jerry
>>
>>
>> 2015-04-27 19:00 GMT+08:00 Marco <[email protected]>:
>>
>>> Hi,
>>>
>>> I'm trying, after reducing by key, to get data ordered among partitions
>>> (like RangePartitioner) and within partitions (like sortByKey or
>>> repartitionAndSortWithinPartition) pushing the sorting down to the
>>> shuffles machinery of the reducing phase.
>>>
>>> I think, but maybe I'm wrong, that the correct way to do that is that
>>> combineByKey call setKeyOrdering function on the ShuflleRDD that it
>>> returns.
>>>
>>> Am I wrong? Can be done by a combination of other transformations with
>>> the same efficiency?
>>>
>>> Thanks,
>>> Marco
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>

Re: ReduceByKey and sorting within partitions

Reply via email to