Re: Why does sortByKey launch cluster job?

Josh Rosen Tue, 10 Dec 2013 08:35:12 -0800

I wonder whether making RangePartitoner .rangeBounds into a lazy val would
fix this (
https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
 We'd need to make sure that rangeBounds() is never called before an action
is performed.  This could be tricky because it's called in the
RangePartitioner.equals() method.  Maybe it's sufficient to just compare
the number of partitions, the ids of the RDDs used to create the
RangePartitioner, and the sort ordering.  This still supports the case
where I range-partition one RDD and pass the same partitioner to a
different RDD.  It breaks support for the case where two range partitioners
created on different RDDs happened to have the same rangeBounds(), but it
seems unlikely that this would really harm performance since it's probably
unlikely that the range partitioners are equal by chance.



On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[email protected]> wrote:

> Thanks for the responses!  I agree that b seems like it would be better.
>  I could imagine optimizations that could be made if a filter call came
> after the sortByKey that would make the initial partitioning sub-optimal.
>  Plus this way, it's a pain to use in the REPL.
>
> Cheers,
>
> Ryan
>
>
> On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[email protected]> wrote:
>
>> Since sortByKey() invokes those right now, we should either a) change the
>> documentation to treat note that it kicks off actions or b) change the
>> method to execute those things lazily.
>>
>> Personally I'd prefer b but don't know how difficult that would be.
>>
>>
>> On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman 
>> <[email protected]>wrote:
>>
>>> Hey Ryan,
>>>
>>> The *sortByKey* method creates a *RangePartitioner* (see
>>> Partitioner.scala), and the initialization code of the
>>> *RangePartitioner* invokes actions *count* and *sample*.
>>>
>>>
>>> Jason
>>>
>>>
>>>
>>>
>>> On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[email protected]>wrote:
>>>
>>>> sortByKey is listed as a data transformation, not an action, yet it
>>>> launches a job.  This doesn't seem to square with the documentation.
>>>>
>>>> Ryan
>>>>
>>>
>>>
>>
>

Re: Why does sortByKey launch cluster job?

Reply via email to