This happened because they were integers equal to 0 mod 5, and we used the default hashCode implementation for integers, which will map them all to 0. There’s no API method that will look at the resulting partition sizes and rebalance them, but you could use another hash function.
Matei On Mar 24, 2014, at 5:20 PM, Walrus theCat <walrusthe...@gmail.com> wrote: > Hi, > > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 > ).coalesce(5,true).glom.collect yields > > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), > Array(), Array()) > > How do I get something more like: > > Array(Array(0), Array(20), Array(40), Array(60), Array(80)) > > Thanks