The number of splits can be configured when reading the file, as an argument to textFile(), sequenceFile(), etc (see docs<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext@textFile(String,Int):RDD[String]>). Note that this is a minimum, however, as certain input sources may not allow partitions larger than a certain size (e.g., reading from HDFS may force partitions to be at most ~130 MB [depending on HDFS block size]).
If you wish to have fewer partitions than the minimum your input source allows, you can use the RDD.coalesce()<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>method to locally combine partitions. On Sun, Nov 17, 2013 at 1:33 PM, Umar Javed <[email protected]> wrote: > Hi, > > When running Spark in the standalone cluster node, is there a way to > configure the number of splits for the input file(s)? It seems like it is > approximately 32 MB for every core be default. Is that correct? For example > in my cluster there are two workers, each running on a machine with two > cores. For an input file of size 500MB, Spark schedules 16 tasks for the > initial map (500/32 ~ 16) > > thanks! > Umar >
