Re: coalescing RDD into equally sized partitions

Walrus theCat Wed, 26 Mar 2014 16:13:37 -0700

For the record, I tried this, and it worked.


On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat <walrusthe...@gmail.com>wrote:

> Oh.... so if I had something more reasonable, like RDD's full of tuples of
> say, (Int,Set,Set), I could expect a more uniform distribution?
>
> Thanks
>
>
> On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia 
> <matei.zaha...@gmail.com>wrote:
>
>> This happened because they were integers equal to 0 mod 5, and we used
>> the default hashCode implementation for integers, which will map them all
>> to 0. There's no API method that will look at the resulting partition sizes
>> and rebalance them, but you could use another hash function.
>>
>> Matei
>>
>> On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthe...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
>> ).coalesce(5,true).glom.collect  yields
>> >
>> > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
>> Array(), Array())
>> >
>> > How do I get something more like:
>> >
>> >  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
>> >
>> > Thanks
>>
>>
>

Re: coalescing RDD into equally sized partitions

Reply via email to