The number of splits can be configured when reading the file, as an
argument to textFile(), sequenceFile(), etc (see
docs<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext@textFile(String,Int):RDD[String]>).
Note that this is a minimum, however, as certain input sources may not
allow partitions larger than a certain size (e.g., reading from HDFS may
force partitions to be at most ~130 MB [depending on HDFS block size]).

If you wish to have fewer partitions than the minimum your input source
allows, you can use the
RDD.coalesce()<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>method
to locally combine partitions.


On Sun, Nov 17, 2013 at 1:33 PM, Umar Javed <[email protected]> wrote:

> Hi,
>
> When running Spark in the standalone cluster node, is there a way to
> configure the number of splits for the input file(s)? It seems like it is
> approximately 32 MB for every core be default. Is that correct? For example
> in my cluster there are two workers, each running on a machine with two
> cores. For an input file of size 500MB, Spark schedules 16 tasks for the
> initial map (500/32 ~ 16)
>
> thanks!
> Umar
>

Reply via email to