@deenar- i like the custom partitioner strategy that you mentioned. i think it's very useful.
as a thought-exercise, is it possible to re-partition your RDD to more-evenly distribute the long-running tasks among the short-running tasks by ordering the key's differently? this would play nice with the existing RangePartitioner. or perhaps manipulate the key's hashCode() to more-evenly-distribute the tasks to play nicely with the existing HashPartitioner? i don't know if either of these are beneficial, but throwing them out for the sake of conversation... -chris On Fri, May 2, 2014 at 11:10 AM, Andrew Ash <and...@andrewash.com> wrote: > Deenar, > > I haven't heard of any activity to do partitioning in that way, but it > does seem more broadly valuable. > > > On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar > <deenar.toras...@db.com>wrote: > >> I have equal sized partitions now, but I want the RDD to be partitioned >> such >> that the partitions are equally weighted by some attribute of each RDD >> element (e.g. size or complexity). >> >> I have been looking at the RangePartitioner code and I have come up with >> something like >> >> EquallyWeightedPartitioner(noOfPartitions, weightFunction) >> >> 1) take a sum or (sample) of complexities of all elements and calculate >> average weight per partition >> 2) take a histogram of weights >> 3) assign a list of partitions to each bucket >> 4) getPartition(key: Any): Int would >> a) get the weight and then find the bucket >> b) assign a random partition from the list of partitions associated with >> each bucket >> >> Just wanted to know if someone else had come across this issue before and >> there was a better way of doing this. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Equally-weighted-partitions-in-Spark-tp5171p5212.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >