Okay. Can't I supply the same partitioner I used for "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote: > repartitionAndSortWithinPartitions partitions the rdd and sorts within > each partition. so each partition is fully sorted, but the rdd is not > sorted. > > sortByKey is basically the same as repartitionAndSortWithinPartitions > except it uses a range partitioner so that the entire rdd is sorted. > however since sortByKey uses a different partitioner than > repartitionAndSortWithinPartitions you do not get much benefit from running > sortByKey after repartitionAndSortWithinPartitions (because all the data > will get shuffled again) > > > On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.puni...@gmail.com> > wrote: > >> Hi Koert >> >> I have already used "repartitionAndSortWithinPartitions" for secondary >> sorting and it works fine. Just wanted to know whether it will sort the >> entire RDD or not. >> >> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> repartitionAndSortWithinPartit sort by keys, not values per key, so not >>> really secondary sort by itself. >>> >>> for secondary sort also check out: >>> https://github.com/tresata/spark-sorted >>> >>> >>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.puni...@gmail.com> >>> wrote: >>> >>>> Hi guys >>>> >>>> In my spark/scala code I am implementing secondary sort. I wanted to >>>> know, when I call the "repartitionAndSortWithinPartitions" method, the >>>> whole (entire) RDD will be sorted or only the individual partitions will be >>>> sorted? >>>> If its the latter case, will applying a "sortByKey" after >>>> "repartitionAndSortWithinPartitions" be faster now that the individual >>>> partitions are sorted? >>>> >>>> -- >>>> Thank You >>>> >>>> Regards >>>> >>>> Punit Naik >>>> >>> >>> >> >> >> -- >> Thank You >> >> Regards >> >> Punit Naik >> > >