You can use the RDD.coalese method to adjust the partitioning of a single RDD. Note that you'll need to set "shuffle=true" when increasing the number of partitions.
See: http://spark.incubator.apache.org/docs/latest/api/core/org/apache/spark/rdd/RDD.html#coalesce(Int,Boolean):RDD[T] On Thu, Jan 23, 2014 at 1:18 PM, Manoj Samel <[email protected]> wrote: > Hi, > > On some RDD actions, I noticed ~500 tasks being executed. In the tasks > details, most of the tasks were too small IMO and may be the task > startup/shutdown/coordination overhead is coming into picture. The task > durations are > > Min : 5ms > 25th %ile: 9ms > Median: 10ms > 75th %ile: 13 ms > Max: 40 ms > > In the RDDs, number of partitions are 428 for Many RDDs built on top of each > other. The base RDD could benefit from large number of partitions but RDDs > derived from it should have much less # of partitions. > > How to control # of partitions @ RDD level ? > >
