Hi Actually,you can set the partition num by yourself to change the 'spark.default.parallelism' property .Otherwise,spark will use the default partition defaultParallelism.
For Local Model,the defaultParallelism = totalcores. For Local Cluster Model,the defaultParallelism= math.max(totalcores, 2). In addition,for hadoopFile,the default partition min number is not the same. def defaultMinSplits: Int = math.min(defaultParallelism, 2) 2014-04-16 5:54 GMT+08:00 Nicholas Chammas <nicholas.cham...@gmail.com>: > Looking at the Python version of > textFile()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile>, > shouldn't it be "*max*(self.defaultParallelism, 2)"? > > If the default parallelism is, say 4, wouldn't we want to use that for > minSplits instead of 2? > > > On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > >> Yup, one reason it’s 2 actually is to give people a similar experience to >> working with large files, in case their code doesn’t deal well with the >> file being partitioned. >> >> Matei >> >> On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote: >> >> Take a look at the minSplits argument for SparkContext#textFile [1] -- >> the default value is 2. You can simply set this to 1 if you'd prefer not to >> split your data. >> >> [1] >> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext >> >> >> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote: >> >>> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb >>> >>> Given the size, and that it is a single file, I assumed it would only be >>> in a single partition. But when I cache it, I can see in the Spark App UI >>> that it actually splits it into two partitions: >>> >>> <sparkdev_2014-04-11.png> >>> >>> Is this correct behavior? How does Spark decide how big a partition >>> should be, or how many partitions to create for an RDD. >>> >>> If it matters, I have only a single worker in my "cluster", so both >>> partitions are stored on the same worker. >>> >>> The file was on HDFS and was only a single block. >>> >>> Thanks for any insight. >>> >>> Diana >>> >>> >>> >> >> >