I will try using per partition sorted data. Can I also use groupBy and join per partition? Basically I want to restrict the computation per partition like using this data.mapPartitions(_.toList.sortBy(...).toIterator). Is there a more direct way to create a RDD that does partition wise operations?
On Sat, Oct 26, 2013 at 3:50 AM, Aaron Davidson <[email protected]> wrote: > Currently, our sortByKey should be using Java's native Timsort > implementation, which is an adaptive sort. That should also mean sorting is > very fast for almost-sorted data. The overhead you're seeing might be > caused by reshuffling everything during the range partitioning step *before > *the sort, which has to serialize all your data. > > Nathan's solution might then work out nicely for you, as it will avoid > shuffling the data. > > > On Fri, Oct 25, 2013 at 9:18 AM, Josh Rosen <[email protected]> wrote: > >> Adaptive sorting algorithms (https://en.wikipedia.org/wiki/Adaptive_sort) >> can benefit from presortedness in their inputs, so that might be a >> helpful search term for researching this problem. >> >> >> On Fri, Oct 25, 2013 at 7:23 AM, Nathan Kronenfeld < >> [email protected]> wrote: >> >>> I suspect from his description the difference is negligible for his >>> case. However, there are ways around that anyway. >>> >>> Assuming a fixed data set (as opposed to something like a streaming >>> example, where there is no last element), one can take 3 passes to: >>> >>> 1. get the last element of each partition >>> 2. take elements from each partition that fall before the last >>> element of the previous partition, separate them from the rest of their >>> partition >>> 3. and add them to the previous (whichever previous is appropriate, >>> in really degenerate cases, which it sounds like he doesn't have) in the >>> right location >>> >>> >>> >>> >>> On Fri, Oct 25, 2013 at 10:17 AM, Sebastian Schelter <[email protected]>wrote: >>> >>>> Using a local sort per partition only gives a correct result if the data >>>> is already range partitioned. >>>> >>>> On 25.10.2013 16:11, Nathan Kronenfeld wrote: >>>> > Since no one else has answered... >>>> > I assume: >>>> > >>>> > data.mapPartitions(_.toList.sortBy(...).toIterator) >>>> > >>>> > would work, but I also suspect there's a better way. >>>> > >>>> > >>>> > On Fri, Oct 25, 2013 at 5:01 AM, Arun Kumar <[email protected]> >>>> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> I am trying to process some logs and the data is sorted(*almost*) by >>>> >> timestamp. >>>> >> If I do a full sort it takes a lot of time. Is there some way to >>>> sort more >>>> >> efficiently (like restricting sort to per partition). >>>> >> >>>> >> Thanks in advance >>>> >> >>>> > >>>> > >>>> > >>>> >>>> >>> >>> >>> -- >>> Nathan Kronenfeld >>> Senior Visualization Developer >>> Oculus Info Inc >>> 2 Berkeley Street, Suite 600, >>> Toronto, Ontario M5A 4J5 >>> Phone: +1-416-203-3003 x 238 >>> Email: [email protected] >>> >> >> >
