Hi Josh, I just ran into this again myself and noticed that the source hasn't changed since we discussed in December. Should I file an official bug in Jira?
Andrew On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <[email protected]> wrote: > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this ( > https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare > the number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case > where I range-partition one RDD and pass the same partitioner to a > different RDD. It breaks support for the case where two range partitioners > created on different RDDs happened to have the same rangeBounds(), but it > seems unlikely that this would really harm performance since it's probably > unlikely that the range partitioners are equal by chance. > > > On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[email protected]>wrote: > >> Thanks for the responses! I agree that b seems like it would be better. >> I could imagine optimizations that could be made if a filter call came >> after the sortByKey that would make the initial partitioning sub-optimal. >> Plus this way, it's a pain to use in the REPL. >> >> Cheers, >> >> Ryan >> >> >> On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[email protected]> wrote: >> >>> Since sortByKey() invokes those right now, we should either a) change >>> the documentation to treat note that it kicks off actions or b) change the >>> method to execute those things lazily. >>> >>> Personally I'd prefer b but don't know how difficult that would be. >>> >>> >>> On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman >>> <[email protected]>wrote: >>> >>>> Hey Ryan, >>>> >>>> The *sortByKey* method creates a *RangePartitioner* (see >>>> Partitioner.scala), and the initialization code of the >>>> *RangePartitioner* invokes actions *count* and *sample*. >>>> >>>> >>>> Jason >>>> >>>> >>>> >>>> >>>> On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[email protected]>wrote: >>>> >>>>> sortByKey is listed as a data transformation, not an action, yet it >>>>> launches a job. This doesn't seem to square with the documentation. >>>>> >>>>> Ryan >>>>> >>>> >>>> >>> >> >
