Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Xinh, Thanks! Custom partitioner with partitionBy() did the job. On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh wrote: > Hi Ayman, > > Have you looked at this: >

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
Well, for Python, it should be rdd.coalesce(10, shuffle=True) I have had good success with this using the Scala API in Spark 1.6.1. -Don On Tue, May 10, 2016 at 3:15 PM, Ayman Khalil wrote: > And btw, I'm using the Python API if this makes any difference. > > On Tue,

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Xinh Huynh
Hi Ayman, Have you looked at this: http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where It recommends defining a custom partitioner and (PairRDD) partitionBy method to accomplish this. Xinh On Tue, May 10, 2016 at 1:15 PM,

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
And btw, I'm using the Python API if this makes any difference. On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil wrote: > Hi Don, > > This didn't help. My original rdd is already created using 10 partitions. > As a matter of fact, after trying with rdd.coalesce(10, shuffle

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Don, This didn't help. My original rdd is already created using 10 partitions. As a matter of fact, after trying with rdd.coalesce(10, shuffle = true) out of curiosity, the rdd partitions became even more imbalanced: [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be evenly balanced. This obviously triggers a shuffle, so be advised it could be an expensive operation depending on your RDD size. -Don On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil wrote: >

Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hello, I have 50,000 items parallelized into an RDD with 10 partitions, I would like to evenly split the items over the partitions so: 50,000/10 = 5,000 in each RDD partition. What I get instead is the following (partition index, partition count): [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4,