@deenar-  i like the custom partitioner strategy that you mentioned.  i
think it's very useful.

as a thought-exercise, is it possible to re-partition your RDD to
more-evenly distribute the long-running tasks among the short-running tasks
by ordering the key's differently?  this would play nice with the existing
RangePartitioner.

or perhaps manipulate the key's hashCode() to more-evenly-distribute the
tasks to play nicely with the existing HashPartitioner?

i don't know if either of these are beneficial, but throwing them out for
the sake of conversation...

-chris


On Fri, May 2, 2014 at 11:10 AM, Andrew Ash <and...@andrewash.com> wrote:

> Deenar,
>
> I haven't heard of any activity to do partitioning in that way, but it
> does seem more broadly valuable.
>
>
> On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar 
> <deenar.toras...@db.com>wrote:
>
>> I have equal sized partitions now, but I want the RDD to be partitioned
>> such
>> that the partitions are equally weighted by some attribute of each RDD
>> element (e.g. size or complexity).
>>
>> I have been looking at the RangePartitioner code and I have come up with
>> something like
>>
>> EquallyWeightedPartitioner(noOfPartitions, weightFunction)
>>
>> 1) take a sum or (sample) of complexities of all elements and calculate
>> average weight per partition
>> 2) take a histogram of weights
>> 3) assign a list of partitions to each bucket
>> 4)  getPartition(key: Any): Int would
>>   a) get the weight and then find the bucket
>>   b) assign a random partition from the list of partitions associated with
>> each bucket
>>
>> Just wanted to know if someone else had come across this issue before and
>> there was a better way of doing this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Equally-weighted-partitions-in-Spark-tp5171p5212.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to