I took a stab at it and wrote a
partitioner<https://github.com/syedhashmi/spark/commit/4ca94cc155aea4be36505d5f37d037e209078196>that
I intend to contribute back to main repo some time later. The
partitioner takes in parameter which governs minimum number of keys /
partition and once all partition hits that limit, it goes round robin. An
alternate strategy could be to go round robin by default. This partitioner
will guarantee equally sized partitions without tinkering with hash codes,
complex balancing computations, etc.

Wanted to get your thoughts on this and any critical comments suggesting
improvements.

Thanks,
Syed.


On Sat, May 3, 2014 at 6:12 PM, Chris Fregly <ch...@fregly.com> wrote:

> @deenar-  i like the custom partitioner strategy that you mentioned.  i
> think it's very useful.
>
> as a thought-exercise, is it possible to re-partition your RDD to
> more-evenly distribute the long-running tasks among the short-running tasks
> by ordering the key's differently?  this would play nice with the existing
> RangePartitioner.
>
> or perhaps manipulate the key's hashCode() to more-evenly-distribute the
> tasks to play nicely with the existing HashPartitioner?
>
> i don't know if either of these are beneficial, but throwing them out for
> the sake of conversation...
>
> -chris
>
>
> On Fri, May 2, 2014 at 11:10 AM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Deenar,
>>
>> I haven't heard of any activity to do partitioning in that way, but it
>> does seem more broadly valuable.
>>
>>
>> On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar 
>> <deenar.toras...@db.com>wrote:
>>
>>> I have equal sized partitions now, but I want the RDD to be partitioned
>>> such
>>> that the partitions are equally weighted by some attribute of each RDD
>>> element (e.g. size or complexity).
>>>
>>> I have been looking at the RangePartitioner code and I have come up with
>>> something like
>>>
>>> EquallyWeightedPartitioner(noOfPartitions, weightFunction)
>>>
>>> 1) take a sum or (sample) of complexities of all elements and calculate
>>> average weight per partition
>>> 2) take a histogram of weights
>>> 3) assign a list of partitions to each bucket
>>> 4)  getPartition(key: Any): Int would
>>>   a) get the weight and then find the bucket
>>>   b) assign a random partition from the list of partitions associated
>>> with
>>> each bucket
>>>
>>> Just wanted to know if someone else had come across this issue before and
>>> there was a better way of doing this.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Equally-weighted-partitions-in-Spark-tp5171p5212.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Reply via email to