Re: partitioning of small data sets
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.comwrote: I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this correct behavior? How does Spark decide how big a partition should be, or how many partitions to create for an RDD. If it matters, I have only a single worker in my cluster, so both partitions are stored on the same worker. The file was on HDFS and was only a single block. Thanks for any insight. Diana inline: sparkdev_2014-04-11.png
Re: partitioning of small data sets
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.com wrote: I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: sparkdev_2014-04-11.png Is this correct behavior? How does Spark decide how big a partition should be, or how many partitions to create for an RDD. If it matters, I have only a single worker in my cluster, so both partitions are stored on the same worker. The file was on HDFS and was only a single block. Thanks for any insight. Diana
Re: partitioning of small data sets
Looking at the Python version of textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile, shouldn't it be *max*(self.defaultParallelism, 2)? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.comwrote: I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: sparkdev_2014-04-11.png Is this correct behavior? How does Spark decide how big a partition should be, or how many partitions to create for an RDD. If it matters, I have only a single worker in my cluster, so both partitions are stored on the same worker. The file was on HDFS and was only a single block. Thanks for any insight. Diana