Okay. Can't I supply the same partitioner I used for
"repartitionAndSortWithinPartitions" as an argument to "sortByKey"?

On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:

> repartitionAndSortWithinPartitions partitions the rdd and sorts within
> each partition. so each partition is fully sorted, but the rdd is not
> sorted.
>
> sortByKey is basically the same as repartitionAndSortWithinPartitions
> except it uses a range partitioner so that the entire rdd is sorted.
> however since sortByKey uses a different partitioner than
> repartitionAndSortWithinPartitions you do not get much benefit from running
> sortByKey after repartitionAndSortWithinPartitions (because all the data
> will get shuffled again)
>
>
> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.puni...@gmail.com>
> wrote:
>
>> Hi Koert
>>
>> I have already used "repartitionAndSortWithinPartitions" for secondary
>> sorting and it works fine. Just wanted to know whether it will sort the
>> entire RDD or not.
>>
>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>>> really secondary sort by itself.
>>>
>>> for secondary sort also check out:
>>> https://github.com/tresata/spark-sorted
>>>
>>>
>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.puni...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys
>>>>
>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>>> sorted?
>>>> If its the latter case, will applying a "sortByKey" after
>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>> partitions are sorted?
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
>>>>
>>>
>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>

Reply via email to