Re: almost sorted data

Nathan Kronenfeld Mon, 28 Oct 2013 07:45:04 -0700

I'm not sure what you're asking.

At some level, all RDDs only do partition-wise operations - they all only
operate on one partition at a time.


I suspect what you're looking for is something where you could just write:

data.mapPartitions(_.sortBy(...))

If that's what you want, then no - but only because Iterator has no sortBy
method.  I'm not sure why mapPartitions hands one an iterator rather than a
list.  Presumably so one can avoid having to have the whole partition in
memory at once - but equally presumably, one already has the whole
partition in memory at once, so that seems odd to me.  Anyone know why?
Perhaps to allow for worst-case scenarios?

             -Nathan



On Mon, Oct 28, 2013 at 4:54 AM, Arun Kumar <[email protected]> wrote:

> I will try using per partition sorted data. Can I also use groupBy and
> join per partition? Basically I want to restrict the computation per
> partition like using this data.mapPartitions(_.toList.sortBy(...).toIterator).
> Is there a more direct way to create a RDD that does partition wise
> operations?
>
>
> On Sat, Oct 26, 2013 at 3:50 AM, Aaron Davidson <[email protected]>wrote:
>
>> Currently, our sortByKey should be using Java's native Timsort
>> implementation, which is an adaptive sort. That should also mean sorting is
>> very fast for almost-sorted data. The overhead you're seeing might be
>> caused by reshuffling everything during the range partitioning step *before
>> *the sort, which has to serialize all your data.
>>
>> Nathan's solution might then work out nicely for you, as it will avoid
>> shuffling the data.
>>
>>
>> On Fri, Oct 25, 2013 at 9:18 AM, Josh Rosen <[email protected]> wrote:
>>
>>> Adaptive sorting algorithms (https://en.wikipedia.org/wiki/Adaptive_sort)
>>> can benefit from presortedness in their inputs, so that might be a
>>> helpful search term for researching this problem.
>>>
>>>
>>> On Fri, Oct 25, 2013 at 7:23 AM, Nathan Kronenfeld <
>>> [email protected]> wrote:
>>>
>>>> I suspect from his description the difference is negligible for his
>>>> case.  However, there are ways around that anyway.
>>>>
>>>> Assuming a fixed data set (as opposed to something like a streaming
>>>> example, where there is no last element), one can take 3 passes to:
>>>>
>>>>    1. get the last element of each partition
>>>>    2. take elements from each partition that fall before the last
>>>>    element of the previous partition, separate them from the rest of their
>>>>    partition
>>>>    3. and add them to the previous (whichever previous is appropriate,
>>>>    in really degenerate cases, which it sounds like he doesn't have) in the
>>>>    right location
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 25, 2013 at 10:17 AM, Sebastian Schelter 
>>>> <[email protected]>wrote:
>>>>
>>>>> Using a local sort per partition only gives a correct result if the
>>>>> data
>>>>> is already range partitioned.
>>>>>
>>>>> On 25.10.2013 16:11, Nathan Kronenfeld wrote:
>>>>> > Since no one else has answered...
>>>>> > I assume:
>>>>> >
>>>>> >     data.mapPartitions(_.toList.sortBy(...).toIterator)
>>>>> >
>>>>> > would work, but I also suspect there's a better way.
>>>>> >
>>>>> >
>>>>> > On Fri, Oct 25, 2013 at 5:01 AM, Arun Kumar <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> I am trying to process some logs and the data is sorted(*almost*) by
>>>>> >> timestamp.
>>>>> >> If I do a full sort it takes a lot of time. Is there some way to
>>>>> sort more
>>>>> >> efficiently (like restricting sort to per partition).
>>>>> >>
>>>>> >> Thanks in advance
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nathan Kronenfeld
>>>> Senior Visualization Developer
>>>> Oculus Info Inc
>>>> 2 Berkeley Street, Suite 600,
>>>> Toronto, Ontario M5A 4J5
>>>> Phone:  +1-416-203-3003 x 238
>>>> Email:  [email protected]
>>>>
>>>
>>>
>>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  [email protected]

Re: almost sorted data

Reply via email to