And btw, I'm using the Python API if this makes any difference. On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil <aymkhali...@gmail.com> wrote:
> Hi Don, > > This didn't help. My original rdd is already created using 10 partitions. > As a matter of fact, after trying with rdd.coalesce(10, shuffle = > true) out of curiosity, the rdd partitions became even more imbalanced: > [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6, > 5120), (7, 5120), (8, 5120), (9, *6144*)] > > > On Tue, May 10, 2016 at 10:16 PM, Don Drake <dondr...@gmail.com> wrote: > >> You can call rdd.coalesce(10, shuffle = true) and the returning rdd will >> be evenly balanced. This obviously triggers a shuffle, so be advised it >> could be an expensive operation depending on your RDD size. >> >> -Don >> >> On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil <aymkhali...@gmail.com> >> wrote: >> >>> Hello, >>> >>> I have 50,000 items parallelized into an RDD with 10 partitions, I would >>> like to evenly split the items over the partitions so: >>> 50,000/10 = 5,000 in each RDD partition. >>> >>> What I get instead is the following (partition index, partition count): >>> [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4, 5120), (5, 5120), (6, >>> 5120), (7, 5120), (8, 5120), (9, 4944)] >>> >>> the total is correct (4096 + 4944 + 8*5120 = 50,000) but the partitions >>> are imbalanced. >>> >>> Is there a way to do that? >>> >>> Thank you, >>> Ayman >>> >> >> >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> https://twitter.com/dondrake <http://www.MailLaunder.com/> >> 800-733-2143 >> > >