For the record, I tried this, and it worked.
On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat <walrusthe...@gmail.com>wrote: > Oh.... so if I had something more reasonable, like RDD's full of tuples of > say, (Int,Set,Set), I could expect a more uniform distribution? > > Thanks > > > On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia > <matei.zaha...@gmail.com>wrote: > >> This happened because they were integers equal to 0 mod 5, and we used >> the default hashCode implementation for integers, which will map them all >> to 0. There's no API method that will look at the resulting partition sizes >> and rebalance them, but you could use another hash function. >> >> Matei >> >> On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthe...@gmail.com> >> wrote: >> >> > Hi, >> > >> > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 >> ).coalesce(5,true).glom.collect yields >> > >> > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), >> Array(), Array()) >> > >> > How do I get something more like: >> > >> > Array(Array(0), Array(20), Array(40), Array(60), Array(80)) >> > >> > Thanks >> >> >