Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Xinh,

Thanks! Custom partitioner with partitionBy() did the job.


On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh  wrote:

> Hi Ayman,
>
> Have you looked at this:
> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
>
> It recommends defining a custom partitioner and (PairRDD) partitionBy
> method to accomplish this.
>
> Xinh
>
> On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil 
> wrote:
>
>> And btw, I'm using the Python API if this makes any difference.
>>
>> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
>> wrote:
>>
>>> Hi Don,
>>>
>>> This didn't help. My original rdd is already created using 10
>>> partitions. As a matter of fact, after trying with rdd.coalesce(10,
>>> shuffle = true) out of curiosity, the rdd partitions became even more
>>> imbalanced:
>>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096),
>>> (6, 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>>
>>>
>>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>>
 You can call rdd.coalesce(10, shuffle = true) and the returning rdd
 will be evenly balanced.  This obviously triggers a shuffle, so be advised
 it could be an expensive operation depending on your RDD size.

 -Don

 On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
 wrote:

> Hello,
>
> I have 50,000 items parallelized into an RDD with 10 partitions, I
> would like to evenly split the items over the partitions so:
> 50,000/10 = 5,000 in each RDD partition.
>
> What I get instead is the following (partition index, partition count):
> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
> 5120), (7, 5120), (8, 5120), (9, 4944)]
>
> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the
> partitions are imbalanced.
>
> Is there a way to do that?
>
> Thank you,
> Ayman
>



 --
 Donald Drake
 Drake Consulting
 http://www.drakeconsulting.com/
 https://twitter.com/dondrake 
 800-733-2143

>>>
>>>
>>
>


Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
Well, for Python, it should be rdd.coalesce(10, shuffle=True)

I have had good success with this using the Scala API in Spark 1.6.1.

-Don

On Tue, May 10, 2016 at 3:15 PM, Ayman Khalil  wrote:

> And btw, I'm using the Python API if this makes any difference.
>
> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
> wrote:
>
>> Hi Don,
>>
>> This didn't help. My original rdd is already created using 10 partitions.
>> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
>> true) out of curiosity, the rdd partitions became even more imbalanced:
>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
>> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>
>>
>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>
>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>>> could be an expensive operation depending on your RDD size.
>>>
>>> -Don
>>>
>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>>> wrote:
>>>
 Hello,

 I have 50,000 items parallelized into an RDD with 10 partitions, I
 would like to evenly split the items over the partitions so:
 50,000/10 = 5,000 in each RDD partition.

 What I get instead is the following (partition index, partition count):
 [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
 5120), (7, 5120), (8, 5120), (9, 4944)]

 the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
 are imbalanced.

 Is there a way to do that?

 Thank you,
 Ayman

>>>
>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake 
>>> 800-733-2143
>>>
>>
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Xinh Huynh
Hi Ayman,

Have you looked at this:
http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where

It recommends defining a custom partitioner and (PairRDD) partitionBy
method to accomplish this.

Xinh

On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil  wrote:

> And btw, I'm using the Python API if this makes any difference.
>
> On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
> wrote:
>
>> Hi Don,
>>
>> This didn't help. My original rdd is already created using 10 partitions.
>> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
>> true) out of curiosity, the rdd partitions became even more imbalanced:
>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
>> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>
>>
>> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>>
>>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>>> could be an expensive operation depending on your RDD size.
>>>
>>> -Don
>>>
>>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>>> wrote:
>>>
 Hello,

 I have 50,000 items parallelized into an RDD with 10 partitions, I
 would like to evenly split the items over the partitions so:
 50,000/10 = 5,000 in each RDD partition.

 What I get instead is the following (partition index, partition count):
 [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
 5120), (7, 5120), (8, 5120), (9, 4944)]

 the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
 are imbalanced.

 Is there a way to do that?

 Thank you,
 Ayman

>>>
>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake 
>>> 800-733-2143
>>>
>>
>>
>


Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
And btw, I'm using the Python API if this makes any difference.

On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil 
wrote:

> Hi Don,
>
> This didn't help. My original rdd is already created using 10 partitions.
> As a matter of fact, after trying with rdd.coalesce(10, shuffle =
> true) out of curiosity, the rdd partitions became even more imbalanced:
> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>
>
> On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:
>
>> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
>> be evenly balanced.  This obviously triggers a shuffle, so be advised it
>> could be an expensive operation depending on your RDD size.
>>
>> -Don
>>
>> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
>> wrote:
>>
>>> Hello,
>>>
>>> I have 50,000 items parallelized into an RDD with 10 partitions, I would
>>> like to evenly split the items over the partitions so:
>>> 50,000/10 = 5,000 in each RDD partition.
>>>
>>> What I get instead is the following (partition index, partition count):
>>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
>>> 5120), (7, 5120), (8, 5120), (9, 4944)]
>>>
>>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
>>> are imbalanced.
>>>
>>> Is there a way to do that?
>>>
>>> Thank you,
>>> Ayman
>>>
>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake 
>> 800-733-2143
>>
>
>


Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Don,

This didn't help. My original rdd is already created using 10 partitions.
As a matter of fact, after trying with rdd.coalesce(10, shuffle = true) out
of curiosity, the rdd partitions became even more imbalanced:
[(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
5120), (7, 5120), (8, 5120), (9, *6144*)]


On Tue, May 10, 2016 at 10:16 PM, Don Drake  wrote:

> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will
> be evenly balanced.  This obviously triggers a shuffle, so be advised it
> could be an expensive operation depending on your RDD size.
>
> -Don
>
> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
> wrote:
>
>> Hello,
>>
>> I have 50,000 items parallelized into an RDD with 10 partitions, I would
>> like to evenly split the items over the partitions so:
>> 50,000/10 = 5,000 in each RDD partition.
>>
>> What I get instead is the following (partition index, partition count):
>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
>> 5120), (7, 5120), (8, 5120), (9, 4944)]
>>
>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
>> are imbalanced.
>>
>> Is there a way to do that?
>>
>> Thank you,
>> Ayman
>>
>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake 
> 800-733-2143
>


Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be
evenly balanced.  This obviously triggers a shuffle, so be advised it could
be an expensive operation depending on your RDD size.

-Don

On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil 
wrote:

> Hello,
>
> I have 50,000 items parallelized into an RDD with 10 partitions, I would
> like to evenly split the items over the partitions so:
> 50,000/10 = 5,000 in each RDD partition.
>
> What I get instead is the following (partition index, partition count):
> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
> 5120), (7, 5120), (8, 5120), (9, 4944)]
>
> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions
> are imbalanced.
>
> Is there a way to do that?
>
> Thank you,
> Ayman
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hello,

I have 50,000 items parallelized into an RDD with 10 partitions, I would
like to evenly split the items over the partitions so:
50,000/10 = 5,000 in each RDD partition.

What I get instead is the following (partition index, partition count):
[(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6,
5120), (7, 5120), (8, 5120), (9, 4944)]

the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions are
imbalanced.

Is there a way to do that?

Thank you,
Ayman