Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.

[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext


On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.comwrote:

 I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb

 Given the size, and that it is a single file, I assumed it would only be
 in a single partition.  But when I cache it,  I can see in the Spark App UI
 that it actually splits it into two partitions:

 [image: Inline image 1]

 Is this correct behavior?  How does Spark decide how big a partition
 should be, or how many partitions to create for an RDD.

 If it matters, I have only a single worker in my cluster, so both
 partitions are stored on the same worker.

 The file was on HDFS and was only a single block.

 Thanks for any insight.

 Diana



inline: sparkdev_2014-04-11.png

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to 
working with large files, in case their code doesn’t deal well with the file 
being partitioned.

Matei

On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:

 Take a look at the minSplits argument for SparkContext#textFile [1] -- the 
 default value is 2. You can simply set this to 1 if you'd prefer not to split 
 your data.
 
 [1] 
 http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
 
 
 On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.com wrote:
 I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
 
 Given the size, and that it is a single file, I assumed it would only be in a 
 single partition.  But when I cache it,  I can see in the Spark App UI that 
 it actually splits it into two partitions:
 
 sparkdev_2014-04-11.png
 
 Is this correct behavior?  How does Spark decide how big a partition should 
 be, or how many partitions to create for an RDD.
 
 If it matters, I have only a single worker in my cluster, so both 
 partitions are stored on the same worker.
 
 The file was on HDFS and was only a single block.
 
 Thanks for any insight.
 
 Diana
 
 
 



Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of
textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile,
shouldn't it be *max*(self.defaultParallelism, 2)?

If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?


On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Yup, one reason it’s 2 actually is to give people a similar experience to
 working with large files, in case their code doesn’t deal well with the
 file being partitioned.

 Matei

 On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:

 Take a look at the minSplits argument for SparkContext#textFile [1] -- the
 default value is 2. You can simply set this to 1 if you'd prefer not to
 split your data.

 [1]
 http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext


 On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll dcarr...@cloudera.comwrote:

 I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb

 Given the size, and that it is a single file, I assumed it would only be
 in a single partition.  But when I cache it,  I can see in the Spark App UI
 that it actually splits it into two partitions:

 sparkdev_2014-04-11.png

 Is this correct behavior?  How does Spark decide how big a partition
 should be, or how many partitions to create for an RDD.

 If it matters, I have only a single worker in my cluster, so both
 partitions are stored on the same worker.

 The file was on HDFS and was only a single block.

 Thanks for any insight.

 Diana